HPC - applying high-performance computing to imaging

Matrox Supersight e2 industrial computer
Matrox Supersight and MIL is a hardware platform and software environment that supports multi-core CPU, GPU and FPGA processing technologies in combination and replicated within the same form factor to provide processing scalability now and into the future (PresseBox) (München, ) Applications with image resolutions, data rates and analysis requirements that exceed the capabilities of a typical workstation computer continue to exist to this day. Moreover, developers must decide how to select and best use the processing technologies – multi-core CPU, GPU and FPGA – at their disposal. The suitability of a scalable heterogeneous computing platform for demanding applications will be examined by way of a representative scenario.

The right match
Finding the most suitable processing technology for a given algorithm or the most suitable algorithm for a given processing technology is a challenge for demanding applications. Suitability is defined as a function of performance – principally speed or throughput – providing an algorithm can run at all on a given processing technology. The algorithms can be classified by their dependence on the incoming image data and the execution model that they support.

At one end of the scale are algorithms that depend minimally, or not at all, on image content. They consist of a short fixed sequence of operations with few or no alternate paths of execution (i.e., branches). Such algorithms generally lend themselves to data parallel execution where the same operation is performed on different data simultaneously. The algorithms work on actual pixels and predominantly employ rudimentary image processing techniques like arithmetic and neighborhood operations. Examples of these are flat-field correction and color filter array (CFA) demosaicing.

At the other end of the scale are algorithms that – directly or indirectly – depend heavily upon image content. They employ a lengthier variable sequence of operations and make significant use of branching and iterative trial (i.e., heuristics). The operations within these algorithms are often restricted to sequential execution and cannot be massively parallelized. These algorithms typically work on edge sets, blobs and a list of points from the target image.

Solving an application may involve using algorithms from either end of the scale. Therefore, the ideal computing platform should be flexible enough to optimally combine the processing technologies, which will most efficiently execute the required algorithms. The platform must also be flexible enough to permit the arrangement of various computing elements into different workgroups or clusters as needed to satisfy application workload and throughput requirements. Consequently, a large amount of data should be allowed to quickly flow among and within the platform’s clusters.

Matrox Supersight e2 with the Matrox Imaging Library MIL
The Matrox Supersight e2 industrial computer is specifically designed for high-throughput and computationally-demanding imaging applications. The platform’s unique high-bandwidth multi-slot PCIe 2.0x16 switched fabric backplane (Figure 1) can be configured to accommodate up to four computing clusters. A computing cluster consists of a Matrox system host board (SHB) with two multi-core CPUs, and optionally one or more (slot count permitting) Matrox frame grabbers – with or without FPGA-based processing offload capability – and one or more (slot count permitting) third-party GPUs.

The clusters within a single Matrox Supersight e2 can work independently – the traffic within a given cluster is not seen by and does not affect the other cluster(s). The clusters can alternatively collaborate with data exchanged between them over the PCIe 2.0x16 backplane at data rates up to 5.5 GB/s (effective measured maximum data rate). Cooperative clusters communicate using Distributed MIL (DMIL), a transparent extension to the Matrox Imaging Library MIL. DMIL gives the ability to remotely access and control image capture, processing, analysis, display, and archiving from one cluster to another (Figure 2).

DMIL simplifies distributed application development by seamlessly dispatching MIL and custom commands, transferring data, sending and receiving event notifications including errors, mirroring threads, and performing function call-back acress systems. It offers low overheads, efficient bandwidth usage, and even lets slave clusters interact with one another without involving the master cluster. With the addition of DMIL, developers have in MIL a single API for heterogeneous computing across multiple systems specific to image processing.

A representative demanding imaging application
The suitability of the Matrox Supersight e2 for demanding imaging applications is best illustrated using a high-speed print inspection system as the example (Figure 3).
The inspection involves carefully comparing a captured image against a template image to identify, classify and act on the differences. The object to inspect is a 50 cm wide web of printed product cartons moving at 34 meters per second. The necessary imaging resolution and speed is achieved using a high-speed color line-scan camera with a 4096 pixels-wide sensor outputting 70,000 lines per second. Since processing is most effective on larger blocks of data – to make best use of the available memory I/O – the pixel data from the line-scan camera is organized into virtual image frames that are 4096 pixels across by 2048 lines high. This translates into an effective rate of 33.9 frames per second.

The need for herterogeneous computing
Before the image analysis can begin, the raw pixel data from the camera must be corrected to account for the non-uniform intensity and vignetting across the inspected web, and transformed into proper color pixels by applying a demosaicing operation (i.e., Bayer filter). Since these steps lend themselves well to data parallel processing and are only performed once as the data is captured, they are ideally suited for a frame grabber with FPGA-based processing capabilities like the Matrox Radient eCL. Although offloading the above-mentioned functions onto the frame grabber frees up the other processing resources for other tasks, it also increases the amount of data to be transferred threefold to approximately 822 MB per second.
The image analysis steps required afterwards to correctly identify possible differences are described in Figure 4.

All the operations used in these steps are available in MIL and are coded to take full advantage of the different processor technologies where applicable.
Two quad-core CPUs are used for the registration, geometric pattern recognition and blob analysis operations because these operations are best suited for sequential execution. The actual application running on the two quad-core CPUs makes judicious use of multiple threads to optimally execute the above mentioned operations while also performing system communication and control using the available cores.
In contrast, the CPU’s massively parallel architecture is ideally suited for the warping, arithmetic (i.e., substraction), masking, binarizing, merging and closing sequence of operations.
An additional six CPU cores would have been required to perform the same task without CPU assist. The length of the sequence enhances the GPU’s effectiveness by diminishing overhead associated with getting data to and from the GPU. That overhead is 5 ms for a three band 8-bit 4096 pixels by 2048 pixels color image (using an ATI FirePro V8800).

Scaling up to meet demand
The optimized partitioning of the application between the FPGA, GPU and two quad-core CPUs results in an inspection rate of 70 cartons per second, which is much slower than the target rate of 135 cartons per second. The shortfall is handled through the Matrox Supersight e2’s ability to host a second cluster comprised of two quad-core CPUs and a GPU (Figure 5).
To distribute the workload, the first (master) cluster controls the image capture (i.e., the frame grabber) and processes the even frames while the second (slave) cluster processes the odd frames (Figure 6). DMIL creates both a message and data (i.e., image) transfer portal between the two clusters. The message transfer portal sends commands, stats information and results, while the data transfer portal sends the images to process and analyze.
Even with substantial assistance from the GPU and a dual cluster arrangement, the CPU cores are used 70% on average – which is a good practice (it is good practice to keep some CPU power in reserve for content dependent processing workloads). To summarize, the required throughput could not be achieved without offloading and accelerating some of the processing on to the GPU and, more importantly, not without having two compute clusters connected together through the very high-speed and efficient PCIe links implemented on the Matrox Supersight e2’s backplane.

Meeting the input and computational requirements of demanding imaging applications requires considering the use of different processing technologies working together in an optimized and scalable hardware platform and software environment. The Matrox Supersight e2 and Matrox Imaging Library MIL is a hardware platform and software environment that supports multi-core CPU, GPU and FPGA processing technologies in combination and replicated within the same form factor to provide processing scalability now and into the future. Matrox Imaging Vision Squad, a team dedicated to helping developers with application feasibility, best strategy and prototyping, can help select the optimal processing technologies and configuration for any given job.

Link Matrox Supersight e2:

Hallo 4 . Stand 4C15

Johann-G.Gutenberg-Str. 20
D-82140 Olching

Tel 0 81 42 / 4 48 41-0
Fax 0 81 42 / 4 48 41-90

E-Mail: info@rauscher.de


RAUSCHER GmbH Systemberatung für Computer und angewandte Grafik
Johann-G.-Gutenbergstr. 20
D-82140 Olching
Ernst Rauscher
Social Media