

# **Implicit Vectorisation**

Stephen Blair-Chappell Intel Compiler Labs

# This training relies on you owning a copy of the following...

#### Parallel Programming with Parallel Studio XE Stephen Blair-Chappell & Andrew Stokes

#### Wiley ISBN: 9780470891650

#### Part I: Introduction

- 1: Parallelism Today
- 2: An Overview of Parallel Studio XE
- 3: Parallel Studio XE for the Impatient



Parallel Programming with Intel<sup>®</sup> Parallel Studio XE Intel<sup>®</sup> Stephen Blan Chapped (Andrew Stokes

#### Part II: Using Parallel Studio XE

- 4: Producing Optimized Code
- 5: Writing Secure Code
- 6: Where to Parallelize
- 7: Implementing Parallelism
- 8: Checking for Errors
- 9: Tuning Parallelism
- 10: Advisor-Driven Design
- 11: Debugging Parallel Applications
- 12: Event-Based Analysis with VTune Amplifier XE

#### Part III :Case Studies

- 13: The World's First Sudoku 'Thirty-Niner'
- 14: Nine Tips to Parallel Heaven
- 15: Parallel Track-Fitting in the CERN Collider
- 16: Parallelizing Legacy Code



# What's in this section?

- (A seven-step optimization process)
- Using different compiler options to optimize your code
- Using auto-vectorization to tune your application to different CPUs



Optimization Notice

## **The Sample Application**

- Initialises two matrices with a numeric sequence
- Does a Matrix Multiplication

| Intel(R) Composer XE 2011 Intel(R) 64                                                                                                                                            | Visual Studio 2008                                                               |                                                                                                                                                    |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------|
| Time Elapsed 0.502719 Secs<br>Time Elapsed 0.479788 Secs<br>Time Elapsed 0.544186 Secs<br>Time Elapsed 0.495235 Secs<br>Time Elapsed 0.491859 Secs<br>Time Elapsed 0.483297 Secs | Total=6798.680541<br>Total=6798.680541<br>Total=6798.680541<br>Total=6798.680541 | Check Sum = 160160000<br>Check Sum = 160160000 |
| •                                                                                                                                                                                |                                                                                  |                                                                                                                                                    |



# The main loop (without timing & printf)

```
// repeat experiment six times
 for( l=0; l<6; l++ )</pre>
  ł
    // initialize matrix a
    sum = Work(&total,a);
    // initialize matrix b;
    for (i = 0; i < N; i++) {</pre>
      for (j=0; j<N; j++) {</pre>
        for (k=0;k<DENOM LOOP;k++) {</pre>
          sum += m/denominator;
        }
        b[N*i + j] = sum;
      }
    }
    // do the matrix manipulation
    MatrixMul( (double (*)[N])a, (double (*)[N])b, (double (*)[N])c);
}
```



# The Matrix Multiply

```
void MatrixMul(double a[N][N], double b[N][N], double c[N][N])
ł
  int i,j,k;
  for (i=0; i<N; i++) {</pre>
    for (j=0; j<N; j++) {</pre>
      for (k=0; k<N; k++) {</pre>
         c[i][j] += a[i][k] * b[k][j];
       }
    }
  }
```



Notice

| Step 1                   | start          |              | Evennle enti    |                      |
|--------------------------|----------------|--------------|-----------------|----------------------|
| Build with               |                |              | Example option  | (Linux)              |
| optimizati               | on disabled    |              | /Od             | (-00)                |
| Step 2                   | ,              |              |                 |                      |
| Use Gene                 | ral            |              |                 |                      |
| Optimizat                | ions           |              | /01,/02,/03     | (-01, -02, -03)      |
| Step 3                   |                |              |                 |                      |
|                          | ssor-Specific  |              | /QxSSE4.2       | (-xsse4.2)           |
| Options                  |                | T            | /QxHOST         | (-xhost)             |
| Step 4                   |                |              |                 |                      |
| Add Inter-               | -procedural    | <b>-</b> - ► | /Qipo           | (-ipo)               |
| Step 5                   |                |              |                 |                      |
| Use Profile              | e Guided       |              | /Qprof-gen      | (-prof-gen)          |
| Optimizat                | ion            |              | /Qprof-use      | (-prof-use)          |
| Step 6                   |                |              |                 |                      |
| Tune auto                | omatic         |              |                 |                      |
| vectorizat               | ion            |              | /Qguide         | (-guide)             |
| Step 7                   |                |              |                 |                      |
|                          | t Dorollaliana |              |                 |                      |
| · ·                      | t Parallelism  |              | Use Intel Famil | y of Parallel Models |
| or use Au<br>Parallelisr |                |              | /Qparallel      | 5                    |
| rai allellSl             | 11             |              |                 |                      |



(intel)

| Step 1                                         | start         | Evenne enti                                  | 0.00                                 |
|------------------------------------------------|---------------|----------------------------------------------|--------------------------------------|
| Build with<br>optimizati                       | on disabled   | <b>Example opti</b><br><i>Windows</i><br>/Od | (Linux)<br>(-00)                     |
| Use Gene<br>Optimizat<br>Step 3                |               | /01,/02,/03                                  | (-01, -02, -03)                      |
| Use Proce<br>Options                           | ssor-Specific | /QxSSE4.2<br>/QxHOST                         | (-xsse4.2)<br>(-xhost)               |
| Step 4                                         | -procedural ► | /Qipo                                        | (-ipo)                               |
| Step 5<br>Use Profile<br>Optimizat             |               | /Qprof-gen<br>/Qprof-use                     | (-prof-gen)<br>(-prof-use)           |
| Step 6<br>Tune auto<br>vectorizat              |               | /Qguide                                      | (-guide)                             |
| Step 7<br>Implemer<br>or use Au<br>Parallelisr |               | Use Intel Fami<br>/Qparallel                 | ly of Parallel Models<br>(-parallel) |





## Intel® Compiler Architecture





# Getting Visibility : Compiler Optimization Report

Compiler switch: -opt-report-phase[=phase] (Linux) ,phase' can be:

- ipo Interprocedural Optimization
- ilo Intermediate Language Scalar Optimization
- hpo High Performance Optimization
- hlo High-level Optimization
- all All optimizations (not recommended, output too verbose)

Control the level of detail in the report: /Qopt-report[0|1|2|3] (Windows) -opt-report[0|1|2|3] (Linux, MacOS X)



Step 2

. . .

# **Optimization Report Example**

icc -03 -opt-report-phase=hlo -opt-report-phase=hpo
icl /03 /Qopt-report-phase:hlo /Qopt-report-phase:hpo

```
...
LOOP INTERCHANGE in loops at line: 7 8 9
Loopnest permutation ( 1 2 3 ) --> ( 2 3 1 )
...
Loop at line 8 blocked by 128
Loop at line 9 blocked by 128
Loop at line 10 blocked by 128
...
Loop at line 10 unrolled and jammed by 4
Loop at line 8 unrolled and jammed by 4
...
...(10)... loop was not vectorized: not inner loop.
...(8)... loop was not vectorized: not inner loop.
...(9)... PERMUTED LOOP WAS VECTORIZED
...
```

icc -vec-report2 (icl /Qvec-report2) for just the vectorization report



Step 2

#### **There are lots of Phases!**

#### icl /Qopt-report-help

Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 12.0.3.175 Build 20110309 Copyright (C) 1985-2011 Intel Corporation. All rights reserved. Intel(R) Compiler Optimization Report Phases usage: -Qopt report phase <phase> ipo, ipo\_inl, ipo\_cp, ipo\_align, ipo\_modref, ipo\_lpt, ipo\_subst, ipo\_ratt, ipo\_vaddr, ipo\_pdce, ipo\_dp, ipo\_gprel, ipo\_pmerge, ipo\_dstat, ipo\_fps, ipo\_ppi, ipo\_unref, ipo\_wp, ipo\_dl, ipo\_psplit, ilo, ilo\_arg\_prefetching, ilo\_lowering, ilo\_strength\_reduction, ilo\_reassociation, ilo\_copy\_propagation, ilo\_convert\_insertion, ilo\_convert\_removal, ilo\_tail\_recursion, hlo, hlo fusion, hlo distribution, hlo scalar replacement, hlo\_unroll, hlo\_prefetch, hlo\_loadpair, hlo\_linear\_trans, hlo opt pred, hlo data trans, hlo string shift replace, hlo ftae, hlo\_reroll, hlo\_array\_contraction, hlo\_scalar\_expansion, hlo\_gen\_matmul, hlo\_loop\_collapsing, hpo, hpo\_analysis, hpo openmp, hpo threadization, hpo vectorization, pgo, tcollect, offload, all



Step 2

# **Getting Visibility : Assembler Listing**

| make chapter4.o CFLAGS=<br><pre> </pre> <pre>   <pre>  <pre>   <pre>   <pre>   <pre>   <pre>   <pre>  <pre>   <pre>   <pre>   <pre>   <pre>   <pre>   <pre>   <pre>   <pre>   <pre>   <pre>   <pre>   <pre>   <pre>   <pre>   <pre>   <pre>   <pre>   <pre>   <pre>   <pre>  <th></th><th></th><th>Generate<br/>assembler file</th></pre></pre></pre></pre></pre></pre></pre></pre></pre></pre></pre></pre></pre></pre></pre></pre></pre></pre></pre></pre></pre></pre></pre></pre></pre></pre></pre></pre></pre></pre></pre></pre></pre></pre></pre></pre></pre></pre></pre> |                                                            |                                                                     | Generate<br>assembler file                                 |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------|---------------------------------------------------------------------|------------------------------------------------------------|
| -> wtime(EXTERN)<br>-> printf(EXTERN)<br>-> printf(EXTERN)<br>-> malloc(EXTERN)<br>-> printf(EXTERN)<br>-> malloc(EXTERN)<br>-> printf(EXTERN)<br>-> malloc(EXTERN)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | isz = 54) (sz = 63 (33+30))<br>4) (isz = 4) (sz = 11 (3+8) | Сг                                                                  | eate a report                                              |
| Assembler Code                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |                                                            | <u>Sc</u>                                                           | <u>ource Code</u>                                          |
| B1.44:<br>movl \$.L_2STRING<br>xorl %eax, %eax<br>tag_value_main.41:<br>call printf<br>tag_value_main.42:<br>jmpB1.41                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | # LUE<br># PredsB1.3<br>5.0, %edi<br># Prob 100%<br># LOE  | <pre># Infreq #36.11 #36.11 #36.11 #36.11 #36.11 #36.11 #</pre>     |                                                            |
| B1.46:<br>xorl %esi, %esi<br>movl \$10, %edx<br>movq 8(%r15), %rdi<br>call <mark>strtol</mark>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | # PredsB1.50<br># LOE rax r12 r13                          | # Infreq <b>i</b><br>#32.19<br>#32.19<br>#32.19<br>#32.19<br>#32.19 | f(argc == 2)<br>denominator = <mark>atoi</mark> (argv[1]); |
| B1.47:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | # PredsB1.46                                               | # Infreq                                                            | Step 2                                                     |
| 10                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |                                                            |                                                                     |                                                            |





# Step 3

# Using Processor Specific Options







## **SIMD Instruction Enhancements**









## **SIMD Types in Processors from** Intel [1]



#### MMX™

Vector size: 64bit Data types: 8, 16 and 32 bit integers VL: 2,4,8 For sample on the left: Xi, Yi 16 bit integers



#### Intel® SSE

Vector size: 128bit Data types: 8,16,32,64 bit integers 32 and 64bit floats VL: 2,4,8,16 Sample: Xi, Yi bit 32 int / float



## **SIMD Types in Processors from** Intel [2]



#### Intel® AVX Vector size: 256bit Data types: 32 and 64 bit floats VL: 4, 8, 16 Sample: Xi, Yi 32 bit int or float



#### Intel® MIC Vector size: 512bit Data types: 32 and 64 bit integers 32 and 64bit floats (some support for 16 bits floats) VL: 8,16 Sample: 32 bit float



# Hands-on Lab



Parallel Programming with Intel<sup>®</sup> Parallel Studio XE trend to here Reader, Parallel Studio XE Stephen Blair-Chappell, Andrew Stokes



#### Key Intel<sup>®</sup> Advanced Vector Extensions (Intel<sup>®</sup> AVX) Features

#### **KEY FEATURES**

#### BENEFITS

| <ul> <li>Wider Vectors</li> <li>Increased from 128 to 256 bit</li> <li>Two 128-bit load ports</li> </ul>    | <ul> <li>Up to 2x peak FLOPs (floating point<br/>operations per second) output with<br/>good power efficiency</li> </ul> |
|-------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|
| <ul> <li>Enhanced Data Rearrangement         <ul> <li>Use the new 256 bit primitives to</li></ul></li></ul> | <ul> <li>Organize, access and pull only</li></ul>                                                                        |
| broadcast, mask loads and                                                                                   | necessary data more quickly and                                                                                          |
| permute data                                                                                                | efficiently                                                                                                              |
| <ul> <li>Three and four Operands: Non</li></ul>                                                             | <ul> <li>Fewer register copies, better</li></ul>                                                                         |
| Destructive Syntax for both AVX                                                                             | register use for both vector and                                                                                         |
| 128 and AVX 256                                                                                             | scalar code                                                                                                              |
| <ul> <li>Flexible unaligned memory</li></ul>                                                                | <ul> <li>More opportunities to fuse load and</li></ul>                                                                   |
| access support                                                                                              | compute operations                                                                                                       |
| <ul> <li>Extensible new opcode (VEX)</li> </ul>                                                             | <ul> <li>Code size reduction</li> </ul>                                                                                  |

Intel<sup>®</sup> AVX is a general purpose architecture, expected to supplant SSE in all applications used today



## A New 3- and 4- Operand Instruction Format

 Intel<sup>®</sup> Advanced Vector Extensions (Intel<sup>®</sup> AVX) has a distinct destination argument that results in fewer register copies, better register use, more load/op macro-fusion opportunities, and smaller code size





Optimization

#### Intel<sup>®</sup> Microarchitecture (Sandy Bridge) Highlights



# **Two Key Decisions to be Made :**

1. How do we **introduce** the vector code ?

2. How do we deal with the **Multiple** SIMD instruction set **extensions** like SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX ...?











# **Overview of Writing Vector Code**

#### **Array Notation**

A[:] = B[:] + C[:];

#### **Elemental Function**

\_\_declspec(vector)
float ef(float a, float b) {
 return a + b;
}
A[:] = ef(B[:], C[:]);

#### **SIMD Directive**

#pragma simd
for (int i = 0; i < N; ++i) {
 A[i] = B[i] + C[i];
}</pre>

## **Auto-Vectorization**

for (int i = 0; i < N; ++i) {
 A[i] = B[i] + C[i];
}</pre>









Optimization Notice

## **Auto-Vectorization**

Transforming sequential code to exploit the vector (SIMD, SSE) processing capabilities



Step 3



Copyright© 2012, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

## How do I know if a loop is vectorised?

-vec-report

> icl /Qvec-report MultArray.c
MultArray.c(92): (col. 5) remark: LOOP WAS VECTORIZED.

Qvec-report1 (default) Qvec-report2 Qvec-report3 Qvec-report4 Qvec-report5 Qvec-report6





#### Diagnostic Level of Vectorization Switch L&M: -vec-report<N> W: /Qvec-report<N>

| N | Diagnostic Messages                                                       |
|---|---------------------------------------------------------------------------|
| 0 | No diagnostic messages; same as not using switch and thus default         |
| 1 | Report about vectorized loops– default if switch is used but N is missing |
| 2 | Report about vectorized loops and non-vectorized loops                    |
| 3 | Same as N=2 but add add information on assumed and proven dependencies    |
| 4 | Report about non-vectorized loops                                         |
| 5 | Same as N=4 but add detail on why vectorization failed                    |

#### Note:

 In case inter-procedural optimization (-ipo or /Qipo) is activated and compilation and linking are separate compiler invocations, the switch needs to be added to the link step





## How do I know if a loop is vectorised?

- -vec-report7
  - Experimental Feature
  - See <a href="http://software.intel.com/en-us/articles/vecanalysis-python-script-for-annotating-intelr-compiler-vectorization-report">http://software.intel.com/en-us/articles/vecanalysis-python-script-for-annotating-intelr-compiler-vectorization-report</a>
  - Requires two python scripts

icc -c -vec-report7 satSub.c 2>&1 | ./vecanalysis/vecanalysis.py -list

| Message                                   | Count %   |
|-------------------------------------------|-----------|
| scalar loop cost: 3.                      | 115 90.6% |
| loop was not vectorized: 1.               | 106 83.5% |
| unmasked unaligned unit stride stores: 2. | 97 76.4%  |
| heavy-overhead vector operations: 4.      | 84 66.1%  |
| unmasked unaligned unit stride loads: 2.  | 79 62.2%  |
| lightweight vector operations: 2.         | 74 58.3%  |
| estimated potential speedup: 0.690000.    | 71 55.9%  |
| vector loop cost: 4.250000.               | 71        |





#### How do I know if a loop is vectorised?

328: TMP1=KGLN

329: TMP2=KST

- VECRPT (col. 1) LOOP WAS VECTORIZED.
- VECRPT (col. 1) estimated potential speedup: 2.860000.
- VECRPT (col. 1) lightweight vector operations: 17.
- VECRPT (col. 1) loop inside vectorized loop at nesting level: 1.
- VECRPT (col. 1) loop was vectorized (with peel/with remainder)
- VECRPT (col. 1) medium-overhead vector operations: 4.
- VECRPT (col. 1) remainder loop was not vectorized: 1.
- VECRPT (col. 1) scalar loop cost: 7.
- VECRPT (col. 1) unmasked aligned unit stride stores: 2.
- VECRPT (col. 1) unmasked unaligned unit stride loads: 3.
- VECRPT (col. 1) unmasked unaligned unit stride stores: 1.

VECRPT (col. 1) vector loop cost: 2.250000.

- 330: DO JLAT=TMP1,KGLX
- 331: !DO JLAT=KGLN,KGLX

332: IADDR(JLAT)=KSTABUF(JLAT)+YDSL%NASLB1\*(0-KFLDN)+YDSL%NASLB1\*(1-KSLEV)

- 333: ENDDO
- 334:





# **Scalar and Packed Instructions**



Optimization Notice 💷

inte



## **Examples of Code Generation**

| <pre>static double A[1000], B[1000],</pre>                                                                                                                                                                                      | .B1.2::<br>movaps xmm2, A[rdx*8]<br>xorps xmm0, xmm0<br>cmpltpd xmm0, xmm2<br>movaps xmm1, B[rdx*8]<br>andps xmm1, xmm0<br>andnps xmm0, C[rdx*8]<br>orps xmm1, xmm0<br>addpd xmm2, xmm1<br>movaps A[rdx*8], xmm2<br>add rdx, 2<br>cmp rdx, 1000 SSE2<br>j1 .B1.2 |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| .B1.2::<br>vmovaps ymm3, A[rdx*8]<br>vmovaps ymm1, C[rdx*8]<br>vcmpgtpd ymm2, ymm3, ymm0<br>vblendvpd ymm4, ymm1,B[rdx*8], ymm2<br>vaddpd ymm5, ymm3, ymm4<br>vmovaps A[rdx*8], ymm5<br>add rdx, 4<br>cmp rdx, 1000<br>j1 .B1.2 | .B1.2::<br>movaps xmm2, A[rdx*8]<br>xorps xmm0, xmm0<br>cmpltpd xmm0, xmm2<br>movaps xmm1, C[rdx*8]<br>blendvpd xmm1, B[rdx*8], xmm0<br>addpd xmm2, xmm1<br>movaps A[rdx*8], xmm2<br>add rdx, 2<br>cmp rdx, 1000<br>j1 .B1.2 SSE4.1                              |



Step 3

## Out-of-the-box behaviour – Intel Compiler

Automatic-vectorisation is enabled by default

(turn it off with -no-vec or /Qvec-)

The option -MSSe2 or /arch:sse2 is used by default (as long as no x, ax or -m option has been used)

-msse2: "May generate Intel® SSE2 and SSE instructions ... This value is only available on Linux systems".



# Building for non-intel CPUs /arch: (-m)

| Option | Description                                                                |
|--------|----------------------------------------------------------------------------|
| mic    | MIC (linux only at moment)                                                 |
| avx    | AVX, SSE4.2, SSE4.1, SSSE3, SSE3, SSE2, and SSE.                           |
| sse4.2 | SSE4.2 SSE4.1, SSSE3, SSE3, SSE2, and SSE.                                 |
| sse4.1 | SSE4.1, SSSE3, SSE3, SSE2, and SSE instructions.                           |
| ssse3  | SSSE3, SSE3, SSE2, and SSE instructions.                                   |
| sse2   | May generate Intel® SSE2 and SSE instructions.                             |
| sse    | This option has been deprecated; it is now the same as specifying ia32.    |
| ia32   | Generates x86/x87 generic code that is compatible with IA-32 architecture. |

This option tells the compiler to generate code specialized for the processor that executes your program.

Code generated with these options should execute on any compatible, non-Intel processor with support for the corresponding instruction set.







## Building for Intel processors /Qx (-x)

| Option                                 | Description                                                                                                                                                                                                                  |
|----------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| CORE-AVX2                              | AVX2, AVX, SSE4.2, SSE4.1, SSSE3, SSE3, SSE2, and SSE instructions .                                                                                                                                                         |
| CORE-AVX-I                             | RDND instr, AVX, SSE4.2, SSE4.1, SSSE3, SSE3, SSE2, and SSE instructions .                                                                                                                                                   |
| AVX                                    | AVX, SSE4.2, SSE4.1, SSSE3, SSE3, SSE2, and SSE instructions .                                                                                                                                                               |
| SSE4.2                                 | SSE4 Efficient Accelerated String and Text Processing instructions supported by Intel® Core <sup>™</sup> i7 processors. SSE4 .1, SSSE3, SSE3, SSE2, and SSE. May optimize for the Intel® Core <sup>™</sup> processor family. |
| SSE4.1                                 | SSE4 Vectorizing Compiler and Media Accelerator, SSSE3, SSE3, SSE2, and SSE. May optimize for Intel® 45nm Hi-k next generation Intel® Core™ microarchitecture.                                                               |
| SSSE3_ATOM<br>(sse3_ATOM<br>depracted) | MOVBE , (depending on -minstruction ), SSSE3, SSE3, SSE2, and SSE .<br>Optimizes for the Intel® Atom <sup>™</sup> processor and Intel® Centrino® Atom <sup>™</sup><br>Processor Technology                                   |
| SSSE3                                  | SSSE3, SSE3, SSE2, and SSE. Optimizes for the Intel® Core <sup>™</sup> microarchitecture.                                                                                                                                    |
| SSE3                                   | SSE3, SSE2, and SSE. Optimizes for the enhanced Pentium® M processor microarchitecture and Intel NetBurst® microarchitecture.                                                                                                |
| SSE2                                   | SSE2 and SSE . Optimizes for the Intel NetBurst® microarchitecture.                                                                                                                                                          |
|                                        | Step                                                                                                                                                                                                                         |





inte

## **Results of Enhancing Auto-Vectorisation**

- SETTING TIME SPEEDUP SSE2 0.293 1
- AVX 0.270 1.09







# Hands-on Lab



Parallel Programming with Intel<sup>®</sup> Parallel Studio XE trend to here Reader, Parallel Studio XE Stephen Blair-Chappell, Andrew Stokes



8/2/2012

**Vectorization Report** 

### "Loop was not vectorized" because:

- "Existence of vector dependence"
- "Non-unit stride used"
- "Mixed Data Types"
- "Condition too Complex"
- "Condition may protect exception"
- "Low trip count"

"Subscript too complex"

- 'Unsupported Loop Structure"
- "Contains unvectorizable statement at line XX"
- "Not Inner Loop"
- "vectorization possible but seems inefficient"
  - "Operator unsuited for vectorization"





e.g. function

calls

### Ways you can help the autovectoriser

- Change data layout avoid non-unit strides
- Use #pragma ivdep
- Use the restrict key word (C \C++)
- Use #pragma vector always
- Use #pragma simd
- Use elemental functions
- Use array notation





Step 3

### **Consistency of SIMD results**

Two issues can effect reproducibility

- Alignment
- Parallelism

Reason: The order the calculations are done can change





# **Alignment of Data**

**SSE2** : works better with 16 byte alignment.

Why? : the XMM registers are 16 bytes (ie 128 bits)

Penalites:

Unaligned access vs aligned access (but still in same cache line) 40% worse.

Unaligned access vs aligned access (but split over cache line) 500% worse.

Rule of Thumb: Try to align to the SIMD register size MMX: 8 Bytes; SSE2: 16 bytes, AVX: 32 bytes

ALSO: Try to align blocks of data to cacheline size – ie 64 bytes

Source: http://software.intel.com/en-us/articles/reducing-the-impact-of-misaligned-memory-accesses/





# **Compiler Intrinsics for Alignment**

### \_\_declspec(align(base, [offset]))

Instructs the compiler to create the variable so that it is aligned on an "base"-byte boundary, with an "offset" (Default=0) in bytes from that boundary

### void\* \_mm\_malloc (int size, int n)

Instructs the compiler to create a pointer to memory such that the pointer is aligned on an n-byte boundary

#pragma vector aligned | unaligned
Use aligned or unaligned loads and stores for vector accesses.

### \_assume\_aligned(a,n)

Instructs the compiler to assume that array a is aligned on an nbyte boundary



"I've stopped using the Intel compiler. Each time I ship the product to a customer, they complain that applications crashes"!"

A games developer at a recent networking event.





# Imagine this scenario:

- 1. Your IT dept have just bought you the latest and greatest Intel based workstation.
- 2. You've heard **auto-vectorisation** can make a real difference to performance
- 3. You enable auto-vectorisation using -xhost
- 4. You boast to your colleagues, "my application runs faster than anything you can write..."
- 5. You send the application to a colleague it refuses to run.





# What might be the issue? How can it be overcome?





# **Running a Mismatched Application**

Intel(R) Composer XE 2011 Intel(R) 64 Visual Studio 2008

C:\du\chapter 4>main

Fatal Error: This program was not built to run on the processor in your system. The allowed processors are: Intel(R) processors with SSE4.2 and POPCNI instructi ons support.



X

ste.









Copyright© 2012, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

#### The vectorised code uses Packed Instructions

| icc -c -vec-report2 chapter4.c                                                              |  |  |  |
|---------------------------------------------------------------------------------------------|--|--|--|
| chapter4.c( <u>56</u> ): (col. <u>9</u> ) remark: loop was not vectorized: vectorization po |  |  |  |
| chapter4.c(55): (col. 7) remark: loop was not vectorized: not inner loop.                   |  |  |  |
| chapter4.c(54): (col. 5) remark: loop was not vectorized: not inner loop.                   |  |  |  |
| chapter4.c(64): (col. 5) remark: PERMUTED LOOP WAS VECTORIZED.                              |  |  |  |
| chapter4.c(64): (col. 5) remark: loop was not vectorized: not inner loop.                   |  |  |  |
| chapter4.c(64): (col. 5) remark: loop was not vectorized: not inner loop.                   |  |  |  |
| chapter4.c(45): (col. 3) remark: loop was not vectorized: nonstandard loop                  |  |  |  |
| chapter4.c(11): (col. <u>7</u> ) remark: loop was not vectorized: existence of vec          |  |  |  |
| chapter4.c(10): (col. 5) remark: loop was not vectorized: not inner loop.                   |  |  |  |
| chapter4.c(9): (col. 3) remark: loop was not vectorized: not inner loop.                    |  |  |  |

..B1.25:

| movsd  | (%r14,%r12,8), %xmm1   | #64.5 |
|--------|------------------------|-------|
| movsd  | 16(%r14,%r12,8), %xmm2 | #64.5 |
| movsd  | 32(%r14,%r12,8), %xmm3 | #64.5 |
| movsd  | 48(%r14,%r12,8), %xmm4 | #64.5 |
| movhpd | 8(%r14,%r12,8), %xmm1  | #64.5 |
| movhpd | 24(%r14,%r12,8), %xmm2 | #64.5 |
| movhpd | 40(%r14,%r12,8), %xmm3 | #64.5 |
| movhpd | 56(%r14,%r12,8), %xmm4 | #64.5 |
| mulpd  | %xmm0, %xmm1           | #64.5 |
| mulpd  | %xmm0, %xmm2           | #64.5 |
| mulpd  | %xmm0, %xmm3           | #64.5 |
| mulpd  | %xmm0, %xmm4           | #64.5 |
| addpd  | (%rdi,%r12,8), %xmml   | #64.5 |
| addpd  | 16(%rdi,%r12,8), %xmm2 | #64.5 |
| addpd  | 32(%rdi,%r12,8), %xmm3 | #64.5 |
| addpd  | 48(%rdi,%rl2,8), %xmm4 | #64.5 |
| movaps | %xmml, (%rdi,%rl2,8)   | #64.5 |
| movaps | %xmm2, 16(%rdi,%r12,8) | #64.5 |
| movaps | %xmm3, 32(%rdi,%r12,8) | #64.5 |
| movaps | %xmm4, 48(%rdi,%r12,8) | #64.5 |
| addq   | \$8, %r12              | #64.5 |
| cmpq   | %r13, %r12             | #64.5 |
| jb     | B1.25 # Prob 99%       | #64.5 |
| jmp    | B1.30 # Prob 100%      | #64.5 |



nte





# Hands-on Lab



Parallel Programming with Intel<sup>®</sup> Parallel Studio XE trend to here Reader, Parallel Studio XE Stephen Blair-Chappell, Andrew Stokes



### **Compiler Options that help** Vectorisation

- -O3 (/O3) performs other loop transformations first
- -ipo (/Qipo) may inline, or get dependency, loop count or alignment information from calling functions
- -xavx (/QxAVX) use all available instructions -xhost (/QxHOST)
- -fno-alias (/Oa) assume pointers not aliased (dangerous!)
- -fargument-noalias assume function arguments not aliased (/Qalias-args-)
- -fansi-alias assume different data types not aliased (/Qansi-alias)
- -guide (/Qguide) get advice on how to help the compiler to vectorize loops





# **Review Sheet for Efficient Vectorization**

- Are you using vector-friendly options such as –ansi-alias and –align array64byte?
- Are all hot loops vectorized and maximizing use of unit-stride accesses?
- Align the data and Tell the compiler
- Have you studied the vec-report6 output for hot-loops to ensure these?
- Are there any peel-loop and remainder-loop generated for your key-loops (Have you added loop\_count pragma)?
  - Make changes to ensure significant runtime is not being spent in such loops
- Are you able to pad your arrays and get improved performance with –optassume-safe-padding?
- Have you added "#pragma vector aligned nontemporal" for all loops with streaming-store accesses to maximize performance?
- Avoid branchy code inside loops to improve vector-efficiency
  - Avoid duplicates between then and else, use builtin\_expect to provide hint, move loopinvariant loads and stores under the branch to outside loops
- Use hardware supported operations only (rest will be emulated)





# **Review Sheet for Vectorization 2**

- Use Intel Cilk Plus extensions for efficient and predictable vectorization
  - #pragma SIMD and !DEC\$ SIMD
    - Counterpart of OMP for vectorization
  - Short-vector array notation for C/C++
    - Shifts burden to the user to express explicit vectorization
    - High-level and portable alternative to using intrinsics
  - Use elemental functions (C and Fortran) for loops with function calls
    - Can also be used to express outer-loop vectorization
- Study opportunities for outer-loop vectorization based on code access patterns
  - Use array-notations OR elemental-functions to express it
- Make memory accesses unit-strided in vector-loops as much as possible
  - Important for C and Fortran
- F90 array notation also can be used in short-vector form





# Thank You

### Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.

#### **Optimization Notice**

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804



