For an application where most of the computation is carried out in loops, Intel® compilers may be able to identify those loops to generate its multithreaded versions automatically. This transformation applies to applications built for deployment on multicore processor platforms, and multicore processor systems with Hyper-Threading Technology (HT Technology) enabled.
Using the -parallel (Linux* and Mac OS* X) or the /Qparallel (Windows*) option enables parallelization for both Intel® microprocessors and non-Intel microprocessors. The resulting executable may get additional performance gain on Intel microprocessors than on non-Intel microprocessors. The parallelization can also be affected by certain options, such as /arch or /Qx (Windows) or -m or -x (Linux and Mac OS X).
The compiler can analyze data flow in loops to determine which loops can be safely and efficiently executed in parallel. Automatic parallelization can sometimes result in shorter execution times. Compiler enabled auto-parallelization can help reduce the time spent performing several common tasks, such as:
If -openmp and -parallel (Linux* and Mac OS* X) or /Qopenmp and /Qparallel (Windows*) compiler options are both specified on the same command line, the compiler only attempts to parallelize those functions that do not contain OpenMP* directives.
The following program contains a loop with a high iteration count:
Example |
---|
#include <math.h> void no_dep() { int a, n = 100000000; float c[n]; for (int i = 0; i < n; i++) { a = 2 * i - 1; c[i] = sqrt(a); } } |
The compile dataflow analysis confirms that the loop does not contain cross-iteration data dependencies. The compiler generates code that divides the iterations as evenly as possible among the threads at runtime. The number of threads defaults to the number of processors but you can set them independently using the OMP_NUM_THREADS environment variable. The increase in parallel speed for a given loop depends on the amount of work, the load balance among threads, the overhead of thread creation and synchronization, etc., but generally it is less than the number of threads. For a whole program, speed increase depends on the ratio of parallel to serial computation.
The following program contains a loop with a high iteration count:
Example |
---|
#include <math.h> void no_dep() { int a, n = 100000000; float c[n]; for (int i = 0; i < n; i++) { a = 2 * i - 1; c[i] = sqrt(a); } } |
The compiler data flow analysis confirms that the loop does not contain cross-iteration data dependencies. The compiler generates code that divides the iterations as evenly as possible among the threads at runtime.
The number of threads defaults to the number of processors but you can set them independently using the OMP_NUM_THREADS environment variable. Increase in parallel speed for a given loop depends on a number of factors like the amount of work, the load balance among threads, the overhead of thread creation, and synchronization but generally it is less than the number of threads. For a whole program, speed increase depends on the ratio of parallel to serial computation.
For builds with separate compiling and linking steps, be sure to link the OpenMP* runtime library when using automatic parallelization. The easiest way to do this is to use the Intel® compiler driver for linking.
Copyright © 1996-2010, Intel Corporation. All rights reserved.