C/C++ Extensions for Array Notations Programming Model

This section provides the syntax and semantics for the C/C++ language extensions for array notations.

Array Section Notation

In your application code, introduce a section operator in the standard C/C++ language, as follows:

section_operator ::= [<lower bound> : <length> : <stride>]

where the <lower bound>, <length>, and <stride> are of integer types, representing a set of integer values as follows:
<lower bound>, <lower bound + <stride>>, …, <lower bound> + (<length> - 1) * <stride>

A section operator can occur in place of a subscript operator. The following example is an array section with len elements: a[lb], a[lb + str], a[lb + 2*str], …, a[lb + (len-1)*str].a[lb:len:str]

Length is chosen instead of upper bound because, in declarations, C/C++ deals with lengths of arrays and not upper bounds. Use of length also makes it easier to ensure that array sections match in size.

Successive section operators designate a sub-array of a multidimensional array object. When absent, the <stride> defaults to 1. If the <length> is less than 1, the array section is undefined. You can also use [:] as a short hand for a whole array dimension if the size of the dimension is known from the array declaration. If either <lower bound> or <length> must be specified, you must specify both.

Example

a[0:3][0:4] // refers to 12 elements in the two-dimensional array a, starting at row 0, column 0, and ending at row 2, column 3.

b[0:2:3] // refers to elements 0 and 3 of the one-dimensional array b

b[:] //refers to the entire array b

Array Declarations for Array Notations

For the array section notation to be useful, the compiler must know the shape and size of an array object. The table below summarizes the different ways of declaring arrays and pointers to array objects in C/C++.

Length Storage Class Declaration
Fixed Static static int a[16][128]
Auto

void foo(void) {

int a[16][128];

}

Parameter void bar(int a[16][128]);
Heap int (*p2d)[128];
Variable (C99) Auto

void foo(int m, int n) {

int a[m][n];

}

Parameter void bar(int m, int n, int a[m][n]);
Heap

void bar(int m, int n) {

int (*p2d)[n];

}

The variable length array (VLA) notation is a C99 [ISO/IEC 9899] extension. It is supported by the GNU GCC and Intel® compilers.

Note iconNote

You must use –std=c99 (Linux* and Mac OS* X) or /Qstd=c99 (Windows*) compiler option for the compiler to accept the C99 extensions.

If the base of an array section has incompletely specified dimensions (such as a pointer variable), you must explicitly specify the length of the array section. This is illustrated in the following example.

Example
typedef int (*p2d)[128];
p2d p = (p2d) malloc(sizeof(int)*rows*128);
 p[:][:] // error
 p[0:rows][:] // ok

Operator Maps

Most C/C++ operators are available for array sections: +, -, *, /, %, <, ==, >, <=, !=, >=, ++, --, |, &, ^, &&, ||, !, -(unary), +(unary), +=, -=, *=, /=, *(pointer de-referencing). Operators are implicitly mapped to all elements of the array section operands. The operations on different elements can be executed in parallel with-out any ordering constraints.

Array operands in an operation must have the same rank and size. Rank is defined as the number of array section operators, and size is the length of each array section. A scalar operand is automatically filled to the whole array section of any rank.

Example
a[:] * b[:] // element-wise multiplication
a[3:2][3:2] + b[5:2][5:2] // matrix addition of the 2x2 matrices in a and b starting at a[3][3] and b[5][5]
a[0:4][1:2] + b[1:2][0:4] // error, different rank sizes
a[0:4][1:2] + b[0][1] // ok, adds a scalar b[0][1] to an array section.

Assignment Maps

The assignment operator applies in parallel to every element of the array section on the left hand side (LHS).

Example
a[:][:] = b[:][2][:] + c;
e[:] = d;
e[:] = b[:][1][:]; // error, different rank
a[:][:] = e[:]; // error, different rank

The right hand side (RHS) of an assignment is evaluated before any element on the left hand side is stored. The compiler will introduce necessary temporary arrays to ensure this semantics.

Example
a[1:s] = a[0:s] + 1; // use old value of a[1:s-1]

Because the RHS is executed before any assignment to the LHS, the compiler can vectorize the RHS computation even if some operand on the RHS may alias with the L-value on the LHS.

Gather and Scatter

When an array section occurs directly under a subscript expression, it designates a set of elements indexed by the values of the array section.

Example
unsigned index[10];
float out[10], in[10];
out[0:5] = in[index[0:5]]; // gather
out[index[5:5]] = in[0:5]; //scatter

If the index values in a scatter array section overlap with each other, the values for the duplicated locations must be the same, otherwise, the final stored value after the scatter is undefined.

Depending on the target architecture option chosen, and if the scatter/gather support is available on the target architecture, the compiler may map the array notations operations to the appropriate hardware.

Reductions

A reduction combines array section elements to generate a scalar result. Intel® Cilk™ Plus supports reductions on array sections. It defines a generic reduction function that applies a user-defined dyadic function. It also has nine built-in common reduction functions. The built-in functions are polymorphic functions that accept int, float, and other C basic data type arguments. The names and descriptions of reduction functions are summarized in the table below.

Reduction Function Prototypes
Function Prototypes Descriptions
__sec_reduce(fun, identity, a[:]) Generic reduction function. Reduces fun across the array a[:] using identity as the initial value.
__sec_reduce_add(a[:]) Built-in reduction function. Adds values passed as arrays
__sec_reduce_mul(a[:]) Built-in reduction function. Multiplies values passed as arrays
__sec_reduce_all_zero(a[:]) Built-in reduction function. Tests that array elements are all zero
__sec_reduce_all_nonzero(a[:]) Built-in reduction function. Tests that array elements are all non-zero
__sec_reduce_any_nonzero(a[:]) Built-in reduction function. Tests for any array element that is non-zero
__sec_reduce_min(a[:]) Built-in reduction function. Determines the minimum value of array elements
__sec_reduce_max(a[:]) Built-in reduction function. Determines the maximum value of array elements
__sec_reduce_min_ind(a[:]) Built-in reduction function. Determines the index of minimum value of array elements
__sec_reduce_max_ind(a[:]) Built-in reduction function. Determines the index of maximum value of array elements

The reduction operation can reduce on multiple ranks. The number of ranks reduced depends on the execution context. For a given execution context of rank m and a reduction array section argument with rank n, where n>m, the last n-m ranks of the array section argument are reduced.

Example
sum = __sec_reduce_add(a[:][:]); // sum across the whole array a
sum_of_column[:] = __sec_reduce_add(a[:][:]); // sum across the column of a

Function Maps

Maps are implicitly defined on scalar functions. All the array section arguments in a scalar function map call must have the same rank. Scalar arguments are automatically filled match any rank.

Example
a[:] = sin(b[:]);
a[:] = pow(b[:], c); // b[:]**c
a[:] = pow(c, b[:]); // c**b[:]
a[:] = foo(b[:]); // user defined function
a[:] = bar(b[:], c[:][:]); //error, different ranks

Mapped function calls are executed in parallel for all the array elements, with no specific ordering. Vector functions may have side effects. When there are conflicts during parallel execution, the program semantics may be different from the serial program.

Function maps are powerful tools used to apply a set of operations in parallel to all elements of an array section. The compiler takes advantage of function maps to generate multi-threaded parallel calls.

Many routines in the svml library are more highly optimized for Intel® microprocessors than for non- Intel microprocessors.

Passing Array Section Arguments

Intel® Cilk™ Plus supports a vector kernel style of programming, where vector code is encapsulated within a function by declaring array parameters of fixed or parameterized vector lengths.

The address of the first element of an array section can be passed as argument to an array parameter. The following example illustrates how to combine array notation to achieve vectorization inside a function body using Intel® Cilk™ Plus threading for parallel function calls.

Note iconNote

The parameter float x[m] depending on an earlier parameter int m is only supported in C99 and not in C++.

Example
#include <cilk/cilk.h>
void saxpy_vec(int m, float a, float x[m], float y[m]){
 y[:]+=a*x[:];
}
void main(void){
 int a[2048], b[2048] ;
 cilk_for (int i = 0; i < 2048; i +=256){
   saxpy_vec(256, 2.0, &a[i], &b[i]);
 }
}

By writing the function explicitly with array arguments, you can write portable vector codes using any threading runtime and scheduler.

Limitations

There are two limitations on the usage of array sections:

More often than not you can convert an if statement into a C/C++ conditional select operation in order to use array sections as shown in the following example.

Example
for (int i = 0; i < n; i++){
 if (a[i] > b[i]){
  c[i] = a[i] - b[i];
 } else{
  d[i] = b[i] - a[i];
 }
}
//can be rewritten as
c[0:n] = (a[0:n] > b[0:n]) ? a[0:n] - b[0:n] : c[0:n];
d[0:n] = (a[0:n] <= b[0:n]) ? b[0:n] - a[0:n] : d[0:n];

You can also pass a return array pointer argument to a function to achieve the same effect as a function return array value.

Programming Hints and Examples

There is no cost associated with writing an application that mixes scalar and data parallel operations on the same arrays. The Intel® compiler uses the array operations in the program to guide vectorization. The following example implements an FIR filter. The scalar code consists of a doubly nested loop where both the inner and outer loop can be vectorized. By writing the program in different ways using array notation, you can direct the compiler to vectorize differently.

Example: FIR Scalar Code
for (i=0; i<M-K; i++){
 s = 0
 for (j=0; j<K; j++){
  s+= x[i+j] * c[j];
 }
y[i] = s;
}
Example: FIR Inner Loop Vector
for (i=0; i<M-K; i++){
 y[i] = __sec_reduce_add(x[i:K] * c[0:K]);
}
Example: FIR Outer Loop Vector
y[0:M-K] = 0;
for (j=0; j<K; j++){
 y[0:M-K]+= x[j:M-K] * c[j];
}

The parallel array assignment semantics enables vectorization of computations on the right hand side (RHS) of the assignment even if the l-value of the assignment may alias with some operands on the RHS. The compiler introduces temporary arrays when aliasing occurs. Temporary arrays increase memory usage and incur extra overhead of writing and reading them. You can help eliminate unnecessary temporary copying by providing better aliasing information such as C99 restrict pointer attribute as shown in the following example:

Example: Using C99 restrict Pointer
void saxpy_vec(int m, float a, float (&x)[m], float(&y)[m]){
 y[:] += a * x[:]; // x may overlap with y,
// temporary array t[n] allocated by the compiler
// t[:] = y[:] + a * x[:]; y[:] = t[:]
}
 
void saxpy_vec(int m, float a, float restrict *x, float(&y)[m]){
 y[:] += a * x[0:m]; // x and y are disjointed, no temporary array
}

To take full advantage of vector instruction set, it is often beneficial to use a fixed vector length supported by the processor, as illustrated in the next example. The example code computes an average over nine points in a two-dimensional grid (avoiding the 1-element boundary around the grid where the nine-point average is not defined). The core computation is written in vector form as in a function nine_point_average. The application is further parallelized using a cilk_for loop at the call site. The grain size of the computation is controlled by the two compile time constants, VLEN and NROWS, which designate the length of vector operations and the number of rows processed in each invocation. Because the loads of adjacent rows are reused across multiple rows, you gain on memory locality by processing multiple rows inside the function. Using compile time constant for vector length makes the vector code generation more efficient and predictable.

Example
#include <malloc.h>
#include <cilk/cilk.h>
#define VLEN 4
#define NROWS 4
 
//-------------------------------------------------------------------
// Vector kernel
// for each grid
// o[x][y] = (i[x-1][y-1] + i[x-1][y]+ i[x-1][y+1] +
// i[x][y-1] + i[x][y] + i[x][y+1] +
// i[x+1][y-1] + i[x+1][y] + i[x+1][y+1])/9;
// written with: 
// 1) VLEN columns for vectorization
// 2) NROWS rows for the reuse of the adjacent row loads
//--------------------------------------------------------------------
 
void nine_point_average(int h, int w, int i, int j, float in[h][w], float out[h][w])
{
 float m[NROWS][VLEN]; 
 m[:][:] = in[i:NROWS][j:VLEN];
 m[:][:] += in[i+1:NROWS][j:VLEN];
 m[:][:] += in[i+2:NROWS][j:VLEN];
 m[:][:] += in[i:NROWS][j+1:VLEN];
 m[:][:] += in[i+1:NROWS][j+1:VLEN];
 m[:][:] += in[i+2:NROWS][j+1:VLEN];
 m[:][:] += in[i:NROWS][j+2:VLEN];
 m[:][:] += in[i+1:NROWS][j+2:VLEN];
 m[:][:] += in[i+2:NROWS][j+2:VLEN];
 out[i:NROWS][j:VLEN] = 0.1111f * m[:][:];
}
 
//---------------------------------------------------------------------
// caller
//---------------------------------------------------------------------
 
const int width = 512;
const int height = 512;
typedef float (*p2d)[];
 
int main() {
 p2d src = (p2d) malloc(width*height*sizeof(float));
 p2d dst = (p2d) malloc(width*height*sizeof(float));
 
// …
// perform average over 9 points
 cilk_for (int i = 0; i < height - NROWS - 3; i += NROWS) {
  for (int j = 0; j < width - VLEN - 3; j += VLEN) {
   nine_point_average(height, width, i, j, src, dst);
  }
 }
}
Command-line entry
icl -Qstd=c99 test.c

See Also


Submit feedback on this help topic

Copyright © 1996-2010, Intel Corporation. All rights reserved.