

## Programming for Intel® Xeon Phi™

Stephen Blair-Chappell







Optimization Notice

8/2/2012

## Code must be

# highly Parallel

# effectively Vectorised

#### **Application Performance: Intel® Xeon Phi™ Coprocessor**



For more information go to http://www.intel.com/performance

Copyright© 2012, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.





# Is the Intel® Xeon Phi<sup>™</sup> Coprocessor right for me?



íntel

## How many threads ?

"An application must scale well past one hundred threads to qualify as highly parallel"



### Jim Jeffers James Reinders. ISBN: 978-0124104143





## **Parallel Performance Potential**



If your performance needs are met by a an Intel Xeon® processor, they will be achieved with fewer threads than on a coprocessor

#### On a coprocessor:

- Need more threads to achieve same performance
- Same thread count can yield less performance

#### Intel Xeon Phi excels on highly parallel applications









8/2/2012

Copyright© 2012, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

Notice

## Vectorisation is ...

#### Faster Code

| 1999                                                                   | 2000                                                                                       | 2004                        | 2006               | 2007                                                                            | 2008                                                    | 2009                                                            | 2011                                                                                                                     | 2012\2013                                                                                        | 2012              |
|------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|-----------------------------|--------------------|---------------------------------------------------------------------------------|---------------------------------------------------------|-----------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|-------------------|
| SSE                                                                    | SSE2                                                                                       | SSE3                        | SSSE3              | SSE4.1                                                                          | SSE4.2                                                  | AES-NI                                                          | AVX                                                                                                                      | AVX2                                                                                             | MIC               |
| 70 instr<br>Single-<br>Precision<br>Vectors<br>Streaming<br>operations | 144 instr<br>Double-<br>precision<br>Vectors<br>8/16/32<br>64/128-bit<br>vector<br>integer | 13 instr<br>Complex<br>Data | 32 instr<br>Decode | 47 instr<br>Video<br>Graphics<br>building<br>blocks<br>Advanced<br>vector instr | 8 instr<br>String/XML<br>processing<br>POP-Count<br>CRC | 7 instr<br>Encryption<br>and<br>Decryption<br>Key<br>Generation | ~100 new<br>instr.<br>~300<br>legacy sse<br>instr<br>updated<br>256-bit<br>vector<br>3 and 4-<br>operand<br>instructions | Int. AVX<br>expands to<br>256 bit<br>Improved<br>bit manip.<br>fma<br>Vector<br>shifts<br>Gather | 512-bit<br>vector |







Copyright© 2012, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

#### Key Differentiators Xeon Phi vs Workstation

# More Cores

## Slower Clock Speed

# Wider SIMD registers Faster Bandwidth

## In-order pipeline



(intel)

Intel Confidential Copyright© 2012, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

#### **Theoretical Peak Flops Performance Example**

Frequency \* Num Sockets \* Num Cores \* Vector Width \* FP Ops

Two socket Intel® Xeon® E5-2670 Processor

| Freq | Sockets | Num<br>Cores |   | FP Ops | GFlops |
|------|---------|--------------|---|--------|--------|
| 2.6  | 2       | 8            | 4 | 2      | 666    |

Single card Xeon Phi Coprocessor (BO)

| 1.091 1 61 16 2 (using FMA) 2,128 | Freq  |   | Num<br>Cores | Vector<br>Width | FP Ops        | GFlops |
|-----------------------------------|-------|---|--------------|-----------------|---------------|--------|
|                                   | 1.091 | 1 | 61           | 16              | 2 (using FMA) | 2,128  |

Optimization Notice





Copyright<sup>®</sup> 2012, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

#### Synthetic Benchmark Summary (Intel® MKL)



Coprocessor results: Benchmark run 100% on coprocessor, no help from Intel® Xeon® processor host (aka native)



#### Intel® Xeon Phi<sup>™</sup> Coprocessor: Increases Application Performance up to 10x

| Segment                           | Customer                                   | Application                                           | Performance Increase <sup>1</sup><br>vs. 25 Xeon* |
|-----------------------------------|--------------------------------------------|-------------------------------------------------------|---------------------------------------------------|
|                                   | Acceleware                                 | 8 <sup>th</sup> order isotropic<br>variable velocity  | Up to 2.23x                                       |
| Energy                            | Sinopec                                    | Seismic Imaging                                       | Up to 2.53x <sup>2</sup>                          |
|                                   | CNPC<br>(China Oil & Gas)                  | GeoEast Pre-Stack Time<br>Migration (Seismic)         | Up to 3.54x <sup>2</sup>                          |
| Financial Services                | Financial Services                         | BlackScholes SP<br>Monte Carlo SP                     | Up to 7.5x<br>Up to 10.75x                        |
| Physics                           | Jefferson Labs                             | Lattice QCD                                           | Up to 2.79x                                       |
| Finite Element                    | Sandia Labs                                | miniFE<br>(Finite Element Solver)                     | Up to 2x <sup>3</sup><br>Up to 1.3x <sup>5</sup>  |
| Solid State<br>Physics            | ZIB<br>(Zuse-Institut Berlin)              | Ising 3D<br>(Solid State Physics)                     | Up to 3.46x                                       |
| Digital Content<br>Creation/Video | Intel Labs                                 | Ray Tracing<br>(incoherent rays)                      | Up to 1.88x <sup>4</sup>                          |
|                                   | NEC                                        | Video Transcoding                                     | Up to 3.0x <sup>2</sup>                           |
| Astronomy                         | CSIRO/ASKAP<br>(Australia Astronomy)       | tHogbom Clean<br>(Astronomy image smear<br>removal)   | Up to 2.27x                                       |
| ,                                 | TUM (Technische<br>Universität München)    | SG++ (Astronomy Adaptive<br>Sparse Grids/Data Mining) | Up to 1.7x                                        |
| Fluid Dynamics                    | AWE (Atomic Weapons<br>Establishment - UK) | Cloverleaf<br>(2D Structured Hydrodynamics)           | 1.77x                                             |

#### Notes:

- 1. 25 Xeon\* vs. 1 Xeon Phi\* (preproduction HW/SW & Application running 100% on coprocessor unless otherwise noted)
- 2S Xeon\* vs. 2S Xeon\* + 2 Xeon Phi\* (offload)
- 3. 8 node cluster, each node with 2S Xeon\* (comparison is cluster performance with and without 1 Xeon Phi\* per node) (Hetero)
- 4. Intel Measured Oct, 2012
- 5. 8 node cluster, each node with 2S Xeon\* (comparison is cluster performance with Xeon only vs. Xeon Phi \*only (1 Xeon Phi\* per node) (Native) Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
  - Source: Customer Measured results as of October 22, 2012. Configuration Details: Please reference, slide speaker notes.





#### Updated

### A Tale of Two Architectures

|                  | Intel® Xeon® processor | Intel® Xeon Phi™ Coprocessor |  |  |
|------------------|------------------------|------------------------------|--|--|
| Sockets          | 2                      | 1                            |  |  |
| Clock Speed      | 2.6 GHz                | 1.1 GHz                      |  |  |
| Execution Style  | Out-of-order           | In-order                     |  |  |
| Cores/socket     | 8                      | Up to 61                     |  |  |
| HW Threads/Core  | 2                      | 4                            |  |  |
| Thread switching | HyperThreading         | Round Robin                  |  |  |
| SIMD widths      | 8SP, 4DP               | 16SP, 8DP                    |  |  |
| Peak Gflops      | 692SP, 346DP           | 2020SP, 1010DP               |  |  |
| Memory Bandwidth | 102GB/s                | 320GB/s                      |  |  |
| L1 DCache/Core   | 32kB                   | 32kB                         |  |  |
| L2 Cache/Core    | 256kB                  | 512kB                        |  |  |
| L3 Cache/Socket  | 30MB                   | none                         |  |  |



# Your code will benefit from running on Xeon Phi if ...

It is highly scalable

Is effectively vectorised
*or* bandwidth
constrained



Intel Confidential Copyright© 2012, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

## Three things to consider

Three components to consider

P – the parallel part of the program

S – the serial part of the program

B – the bandwidth constrained part of the program









## **Compute bound**



Copyright© 2012, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

## The Parallel, Vector and Clock Factors

Parallel Factor = Num Xeon Cores / Num Phi Cores 16 / 61 = 0.26229

Vector Factor =

(Xeon Vector Length \* Xeon Instruction Level Parallelism) / (Phi Vector Length \* Phi Instruction Level Parallelism)

> AVX-FMA\*\* SSE-FMA\*\* SSE-non-FMA

4 \* 2 / 8 \* 2 = .5AVX-non-FMA 4 \* 2 / 8 \* 1 = 12 \* 2 / 8 \* 2 = .252 \* 2 / 8 \* 1 = .5

Clock Factor =

Xeon Frequency / Phi Frequency

3.1/1.09 = 2.844

#### Combined = Parallel Factor \* Vector Factor \* Clock factor

AVX-FMA \* \* AVX-non-FMA SSF-FMA\*\*

0.26229 \* .5 \* 2.844 = 0.3730.26229 \* 1 \* 2.844 = 0.7460.26229 \* .25 \* 2.844 = 0.187 SSE-non-FMA 0.26229 \* .5 \* 2.844 = 0.373

*NB we are comparing 2 socket SNB with* single coprocessor (64 bit floating point doubles) \*\* FMA: source code is capable of using FMA when built for Xeon Phi

#### FMA\*\* x5.38 Faster FMA\*\* x2.68 Faster Non-FMA X2.68 Faster Non-FMA **X1.54 -**aster Optimization intel Notice

Intel Confidential

Copyright© 2012, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

## **The Serial Factor**



Serial Factor = Clock Factor \* ILP Factor \* Issue Factor Where Clock Factor = 2.6 /1.09 For FMA type calculations ILP Factor \*\*\* = 2/2 = 1 For non-FMA type calculations ILP Factor = 2/1 Issue factor = Num cycles to issue instruction on Phi /

Num cycles to issue instruction on Phi / Num cycles to issue instruction on Xeon = 2/1

Note: in single threaded code Xeon Phi uses two cycles to issue an instruction (in threaded mode it takes just one cycle)

\*\* FMA: source code is capable of using Fused Multiple Add when built for Xeon Phi

#### Intel Confidential

Non-FN

x9.54

slower







Optimization

Notice

## Factors (2.6 GHz Clock)

| Host              | SIMD | Serial | Vector | Parallel | Clock |
|-------------------|------|--------|--------|----------|-------|
| Single socket     | AVX  |        | 0.5    | 0.1333   | 2.386 |
| 2.6 GHz.<br>FMA** | SSE2 | 4.772  | 0.25   |          |       |
| Single socket     | AVX  |        | 1      |          |       |
| 2.6 GHz<br>No FMA | SSE2 | 9.544  | 0.5    |          |       |
| Twin socket       | AVX  |        | 0.5    | 0.2666   |       |
| 2.6 GHz<br>FMA**  | SSE2 | 4.772  | 0.25   |          |       |
| Twin socket       | AVX  | 9.544  | 1      |          |       |
| 2.6 GHz<br>No FMA | SSE2 |        | 0.5    |          |       |

Xeon: 8 cores per socket Phi: Using 60 of 61 cores

\*\* FMA: source code is capable of using FMA when built for Xeon Phi NOTE: Serial Factor already includes the Clock factor

#### **Intel Confidential**

21

Copyright<sup>®</sup> 2012, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.





## Factors (3.1 GHz.)

| Host              | SIMD | Serial | Vector | Parallel | Clock |
|-------------------|------|--------|--------|----------|-------|
| Single socket     | AVX  |        | 0.5    |          |       |
| 3.1 GHz.<br>FMA** | SSE2 | 5.69   | 0.25   | 0.1333   | 2.844 |
| Single socket     | AVX  |        | 1      |          |       |
| 3.1 GHz<br>No FMA | SSE2 | 11.38  | 0.5    |          |       |
| Twin socket       | AVX  |        | 0.5    |          |       |
| 3.1 GHz<br>FMA**  | SSE2 | 5.69   | 0.25   |          |       |
| Twin socket       | AVX  |        | 1      |          |       |
| 3.1 GHz<br>No FMA | SSE2 | 11.38  | 0.5    |          |       |

Xeon: 8 cores per socket Phi: Using 60 of 61 cores

\*\* FMA: source code is capable of using FMA when built for Xeon Phi NOTE: Serial Factor already includes the Clock factor

#### Intel Confidential

Copyright<sup>®</sup> 2012, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.





#### 'Finger in the air' speedups (from 2 socket 2.6Ghz SSE2)

- An application that is highly parallel and effectively vectorised will speed up by x2.5
- An application that is highly parallel but not vectorised will speed up by x1.3
- An application that is not parallel but is vectorised will slow down by **x1.5**
- A Serial application will slow down by x12.0
- A Bandwidth constrained application will speed up by x2.4

What you experience in practice may be different from these figures. These are only 'back of the envelope' figures.

#### **Intel Confidential**







# LAB 1 – Activity 1 A Quick Smoke Test

**Intel Confidential** 

5/26/2014



# LAB 1 – Activity 2 Measuring Vectorisation

**Intel Confidential** 



# LAB 1 – Activity 3 Measuring Concurrency

**Intel Confidential** 

## Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.

#### **Optimization Notice**

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804



