Edge AI System Design & Performance Analysis Series: Part 1
CPUs vs GPUs: Contextual Power vs Brute Power
1. Abstract
Designers looking to deploy new AI-based solutions sometimes gravitate towards workload-specific hardware without having proper insights into contextual parameters such as system design, performance analysis and metrics to justify their predisposition.
With recent and upcoming improvements across the AI stack: From Hardware to AI Models, designers may be missing out on the superior overall performance, cost savings, reduced complexity, repurposability and ease of scaling if decisions are made on Belief bias (More on that below) rather than metrics. This paper compares Edge-deployed Video AI workload performance between CPU-based and GPU-based hardware platforms. Specifically, the Part 1 benchmark tests aim to compare Intel’s i5 and i7 processors to Nvidia’s Xavier NX and AGX GPUs across a variety of most popular / demanded AI & AI related workloads at enterprises.
The results indicate that CPUs focused hardware, with (i) Their ability to respond to heterogeneous AI workloads effectively, (ii) Better support and contextual power for co-located non-AI pipelines and applications, and (iii) Generally-lower total cost of ownership, can pull ahead in overall performance against more specialized GPUs focused hardware.
2. Introduction
Belief Bias (credit: APA Dictionary of Psychology): The tendency to be influenced by one’s knowledge about the world in evaluating conclusions and to accept them as true because they are believable rather than because they are logically valid.
One prevalent belief bias among deep-learning system designers is that GPU-focused hardware is an automatically superior choice for any video AI workload. Both CPU-based and GPU-based hardware platforms demonstrate their own unique strengths when contributing to the overall solution. However, both vendors and consumers may reflexively choose GPU-based offerings, without having sufficient system design analysis and performance facts available, leading to less-optimal solutions that end up overpriced, more complex, and less flexible to the changing AI landscape.
This article is an initiative to provide open-book Edge AI system design analysis and benchmarking onoff the shelf Edge-hardware solutions using video AI workload and factors that matter to typical retail-based end users.
3. Benchmarking
3.1 Hardware Selection
The choice of hardware for a video AI system is very subjective, since it is highly dependent on a variety of end-product functionality requirements, including (but not limited to):
- Use cases/types of tasks
- Accuracy requirements and generalization level for the AI solution
- Choice of algorithms and models
- Choice of model parameters
- Design and composition of pipeline
- Deployment strategy
- Requirements for throughput, latency, availability, tolerance, etc.
- TCO: CAPEX & OPEX
- Solution optimizations (e.g., code level, design level)
- Deployment environment, conditions (e.g., constrained, or unconstrained)
- Density of scene or the ‘busy’ nature of a scene
- Occlusion level
- Size and deformation of objects to inference
- Distance from camera
- Lighting etc
Therefore, readers should be mindful of what & how benchmarking was performed for this article, and how the performance results hereunder might apply to their future Edge-deployed video AI designs.
3.2 Benchmarking Purpose
As end users become increasingly aware of how AI can solve some of their enterprise requirements, the number of AI-driven solutions deployed across enterprises has grown significantly. Hardware and software stack options have proliferated into the hundreds, to meet this escalating demand.
However, the growing volume of choices in the AI market make decision-making even more difficult, especially since universal benchmarks for like-to-like comparisons are generally unavailable. Also, the widely varying performance requirements of each customer project, system design choices by vendors makes uniform comparisons between hardware architectures mostly impossible or inconclusive.
Further confusion can arise because of the benchmarks published by OEMs and chipmakers:
- Might not be general enough to predict performance for your specific project.
- Might not cover a wide-enough picture and details required to consider it as reference and predict your end-to-end performance.
- The data might not be relevant to assess real-world performance and practical usage.
- Might not be completely relevant to the design of your project’s Edge AI system.
The benchmark tests performed for this article were aimed at illuminating the best options for Edge-deployed AI applications from leading commercially available hardware, within the scope defined in the following section. Our goal is to provide a thoughtfully crafted reference to guide designers intuition towards Edge AI system and pipeline analysis, design choices to consider for performant and TCO efficient system and its benchmarking, without misleading or confusing ‘marketing’ metrics from OEMs and vendors.
3.3 Benchmarking: Key Decisions
3.3.1 Benchmarking Generalizable Pipelines
Our goal to create a useful and inclusive benchmark was based around finding the most popular use cases that would be applicable to a diverse cross-section of enterprises. This included making sure that the chosen cases could reasonably be deployed on a diverse range of environments without requiring substantial AI modifications.
People counting and people dwelling were determined to be at the core of a wide range of higher-level analytics applications, including consumer insights, safety and security, automation of SOPs in retail, banking and education, warehousing, and many others. Therefore, these use cases appear ideal for our benchmarking requirements because they represent very typical core features found across a great many customer solutions.
Among the key decision factors in constructing a pipeline for any use case is the trade off between computer complexity, and accuracy versus economic gain. An analysis of our past few years of experience in the AI field suggests that deep-learning generalizability is more often achieved with relatively heavier models than tiny or lighter models.
Generalizability is a key measure of how well a solution is usable across a wide range of enterprises and industries, as well as for diverse deployment environments within a single enterprise. Therefore, we adopted this as a cornerstone principal in our benchmarking use-case selection.
Note: Most marketing metrics from Hardware OEMs focus on high throughput numbers from lighter models without explicitly calling out the model’s details. Such lighter models do not generalize well I.e., real world accuracy for most cases in real world. This would be highly misleading customers who lack technical expertise.
Our benchmark focuses on the productionable ‘use cases,’ i.e., the end-to-end pipeline and system responsible for a last-mile end-value creation, while attempting to find the middle ground for an economical but still generalizable solution. We deliberately filtered out lighter models that might offer better throughput on paper but return unusable-poor accuracy or lack generalization capability to apply the AI across reasonably diverse environments, verticals, and enterprises
3.3.2 Edge Deployment Strategy
This section explains why Edge deployment is an ideal choice for our customers.
Simplicity and Cost
Video AI is among the most computationally intensive AI workloads because it is inherently heavy for transmission, storing and processing. Creating pipelines for encoding, decoding, streaming, etcetera, as well as building out a resilient large-scale centralized system to handle AI computing tasks, adds further to the complexity and cost. Furthermore, complexity and cost increase non-linearly with the size of the data to be processed/inferenced centrally. It is also highly challenging to predict and plan ahead for the infrastructure requirements and associated costs of a centralized AI architecture.
Video AI running at a central data centre or cloud requires deep expertise across multiple disciplines, even for just a few hundred to a few thousand camera feeds. Smaller centralized AI deployments are still significantly more complex and expensive than an equivalent Edge-based solution. Many of the most capable enterprises still lack the expertise or clarity required to source the right stack or to provision and maintain a central AI installation. Companies can become lost or mired in the technical complexity rather than remaining focused on their core business value and ROI to be gained from AI. Any resulting missteps in the initial planning of a centralized AI system can become an expensive detour.
Our experience suggests that the following benefits of Edge AI are crucial during the AI adoption and transition phases (from adoption through to scaling AI deployments):
- Rapid turnaround cycles to adopt the new AI technology
- Quick financial return of the initial investment (ROI)
- Flexibility of available solutions
- Simplicity of design and deployment
- Predictable overall and incremental costs (capital expense and operating expense)
Edge-deployed AI offers all these benefits, as well as being more independent / standalone, which reduces its dependency on existing infrastructure and processes at enterprise sites. It is also more forgiving of missteps, since there is linear scale cost, flexibility and no undue complexity.
Improved Security and Privacy
Today’s enterprises recognize that their data holds competitive business advantage and hence strive to protect this valuable asset from 3rd-party access. Since Edge AI moves AI computation closer to where the data is generated and/or stored, the data never has to travel outside secure Edge networks. Processed Edge AI metadata can then be encrypted, transmitted, shared, collated, and collaborated on, to achieve end business objectives, without compromising its privacy or trust during transfer over 3rd party communication networks.
Other Advantages
- Low latency, which leads to more responsive and instantaneous (mission-critical, real-time) processing.
- No-to-low bandwidth requirements with a centralized control authority.
- Federated (independent, autonomous) deployments provide decentralized and offline operational capability.
- Energy efficiency, for both computing power and cooling requirements.
- Ease of maintenance, with form factors ranging from thumb drives to small industrial-grade computers, to full-size tower machines.
- Flexible deployment locations, including challenging environmental conditions and remote areas with poor maintenance access
- When paired with Edge node remote-management capabilities, installations are beginning to be categorized as ‘deploy and forget.’
3.3.3 General Compute Hardware TCO
Realizing the Best Total Cost of Ownership from Compute Hardware
Enterprises of all sizes invest in either lean (distributed) or large, dedicated (centralized) compute infrastructure to handle several essential functions of their operations. For example, consider all the POS systems of a retail brand that are networked across hundreds or thousands of locations, as well as an enterprise’s CCTV infrastructure.
For most enterprises, the majority of compute infrastructure is strictly CPU-driven. Typically, the older and larger an enterprise is, the larger its investment in general purpose CPU compute engines. Their use of purpose-built processors alongside the CPUs, such as FPGAs or GPUs, is low or nil in comparison to their general-purpose CPU-only deployments.
Adoption of new purpose-built processors requires major upgrades to the rest of the hardware stack as well, resulting in a major cost hit to an enterprise’s existing compute infrastructure. This is a huge challenge for enterprises, since adopting AI demands that they complete expensive upgrades before they can begin to realize any return on that investment. For industries operating in a competitive market or on thin margins, this can add considerably to their infrastructure’s cost basis, during both the AI adoption stage and when scaling up AI capabilities.
Even when enterprises have GPU infrastructure components, general-purpose CPUs of similar compute capability are still required to get the most performance out of the GPUs. By leveraging these CPUs to augment overall AI workload processing, enterprises can save a lot on total infrastructure capital expenditure and operating expenses.
We are not attempting to say that purpose-built hardware such as GPUs focusing on AI workloads are not better at AI inferencing than their general-purpose CPU counterparts. Instead, we are focusing on the facts that:
- Enterprises may not be confident in taking the risk of implementing dedicated AI infrastructure when they are not sure about their ROI.
- The barrier to entry into AI is much easier for vendors and enterprises when there is already a ready supply of general-purpose CPU compute power in existing systems that can be utilized.
- As enterprises scale up their existing GPU-based AI capabilities, it may be more efficient and cost effective to enlist their existing general-purpose CPUs to tackle the additional AI inference workloads.
Enterprises with a deployment of capable general-purpose CPUs have a huge advantage in beginning their AI journey, or in maximizing cost efficiency when scaling existing AI infrastructure. The picture becomes clear that general-purpose CPU compute power is irreplaceable in any AI deployment.
Flexibility & Ability to Repurpose
Pipelines for end-to-end AI solution include several other functional blocks besides their key AI inference blocks, including operational support for databases and data lakes, business/application logic, multimedia support, and other software infrastructure support. These other required blocks are often only available on general-purpose CPUs or will generally run faster there than on a GPU.
Therefore, instead of focusing on just the AI-inferencing performance benchmarks that chip makers often cite, we emphasize a more complete picture by benchmarking all-inclusive, end-to-end system performance. This complete benchmark will also capture dependencies and bottlenecks across the entire pipeline to measure whether or not the ever-increasing throughputs on specialty hardware actually translate into a real-world performance impact against CPU-only architectures.
A general-compute CPU-only infrastructure also provides the agility and flexibility to:
- Co-host AI workloads with all other workload types on existing, shared IT Infrastructure (e.g., POS machines for retail shipped with video AI for counting people and providing autonomous store premises security).
- Repurpose, based on demand, the general compute infrastructure dedicated to AI versus other enterprise IT functionalities. With a general-purpose compute architecture, the infrastructure is not locked into a specific workload but can instead be shared, provide service on demand, and repurposed as needed for maximum performance.
4. Benchmark Framework
The following tables explains the overall framework and scope of our benchmarking (across all series).
4.1 Scope of the Benchmark
4.2 Individual Benchmark and Pipeline Benchmark
Since the primary focus of the benchmark is to identify the best value commodity hardware to run Edge-deployed AI workloads, we performed benchmarks using two approaches:
- Individual module benchmark: The objective is to measure the overall throughput of the module while the module is making use of the entire computational power of the selected hardware. This helps us understand how the hardware under test excels at each individual module / task.
- Module as part of pipeline benchmark: The objective is to measure the performance of the entire pipeline, where the module is only part of, instead of the focus of, the throughput testing. In this case, the performance of the module is dependent on external factors such as pipeline performance, settings, restricted use, etc. This helps us understand how the hardware under test excels at handling the entire pipeline rather than just the one module. This is a more realistic setting for real-world use cases.
4.3 Benchmark Pipeline Composition
4.4 Benchmark Workload Type and Data Collected
4.5 Benchmark Data to be Measured
5. Individual Benchmarking of Modules
5.1 Decoding — Individual
Production Live / real time Video Analytics pipelines consume video from 2 types of sources.
- Directly from the camera hardware over the network.
Due to network bandwidth savings and other cost, practical purposes cameras encode the live camera feeds into few commonly known video codings such as H.264, H.265 before they transmit the video stream over the network using application-level network protocols. Commonly used network protocol for live video streaming in CCTV networks is RTSP.
2. From an Intermediate system such as VMS (Video Management Server/systems)
Based on proximity of Video Analytics Systems and VMS and overall architecture, Live video stream can be ingested from VMS via following mechanisms.
- Restreaming: Retransmits / Restreams Live video feeds that are received from camera as is without any processing.
- Live transcoding: Video received from camera is in real time transcoded and streamed for 3rd party consumption.
E.g., Decoding live video feed and streaming decoded feeds to other systems or decoding and re-encoding into any other compression formats and streaming it for consumption by other systems.
For end-to-end benchmark purposes, we also benchmark transcoding. I.e. A live encoded feed from source is ingested and our pipeline takes the responsibility of decoding it before being consumed by AI systems.
Decoding is the first and essential gateway block for performing Video AI.
Incoming Video Stream Properties: H.264 encoding, 25 FPS, 1080p resolution
5.1.1 Background
Intel Plugins used:
vaapih264dec (for decoding) and vaapipostproc (for resizing)
sample gst launch command:
gst-launch-1.0 -v rtspsrc location=${VIDEOURL} tcp-timeout=${TIMEOUT} timeout=${TIMEOUT} protocols=tcp ! rtph264depay ! h264parse ! vaapih264dec ! vaapipostproc force-aspect-ratio=false scale-method=fast width=${WIDTH} height=${HEIGHT} ! “video/x-raw(memory:VASurface)” !
Nvidia Plugins used:
omxh264dec (for decoding) and nvvidconv (for resizing)
sample gst launch command:
gst-launch-1.0 -v rtspsrc location=${VIDEOURL} latency=0 ! rtph264depay ! h264parse ! omxh264dec ! nvvidconv interpolation-method=5 ! ‘video/x-raw, ${WIDTH} height=${HEIGHT}, format=BGRx ! videoconvert ! appsink sync=false
5.1.2 Observations
Video decoding on the Intel 10th Generation i7 NUC produces better throughput than on the Nvidia Xavier NX. With this Intel CPU, we were able to decode up to 23 channels in real time and up to 25 channels with a minor delay of 1–2 FPS across all cameras. However, the Nvidia GPU was only able to decode 12 channels in real time and added a delay up to 80% with 15 and 20 channels.
5.1.2 Decision
Decoding module standalone hardware of choice: Intel 10th Generation i7 NUC
5.2 AI Inference 1: Detection– Individual
Detection is among the very few AI building blocks that are fundamental and at the forefront in most Video AI pipelines. Detection is usually accompanied by pre and post processing blocks to improve accuracy or remove noises from intermediate raw detection outputs. Currently DL models rule Detection inference accuracy and throughput benchmarks.
Each detection network works with a fixed set of input resolution options unless customized. The larger the input resolution is the more accuracy can be, especially for the smaller objects (Thereby also aids to perform well on objects that are far distance from camera). However, this makes the inference time slower. Resizing could be one of the pre-processing blocks.
Another factor that impacts the throughput and accuracy is quantization. Where quantized models trade off a small amount of generalization to achieve far more of gain in overall throughput.
Please refer to Annexure 9 for more detailed information on System design, design decisions and its impact on performance.
Incoming Video Stream Properties: H.264 encoding, 15 FPS, 1080p resolution, 15 objects in scene
5.2.1 Background
Our object detector used for benchmark is a customized network built based on reference Yolo V4. Our customization (Out of relevance for current context) concentrated on achieving higher mAP and better generalization for classes relevant to our focus market (E.g., Retail, Banking, Smart Cities) while still maintaining similar throughputs to reference Yolo V4.
The model used for benchmarking is MIXED precision (f16-f32-i64). In the converted model all the convolution layers are in FP16. The original FP32 model is converted to this Mixed Precision model.
The preprocessing that is part of our detection block is resize and post processing used is Non-Maximal Suppression (NMS) and confidence threshold-based filter.
5.2.2 Observations
On the Intel 10th Generation i7 NUC we were able achieve 15 FPS for object detection inferencing across CPU- and integrated GPU-based inferencing. Thus, if we allocate 2 FPS per camera for object detection inferencing, according to Appendix section “Decision 1: Determining FPS for Object Detection”, we can inference 7 cameras in real time. On the Nvidia Xavier NX, we were able to achieve up to 18.80 FPS on their GPU.
Parallel batching from up to 4 multiple simultaneous cameras produced the best inference throughputs.
We also attempted activating DLA + GPU inferencing, and to our surprise, the throughput was not better than GPU-only inferencing, due to limitations on DLA inferencing at the time of benchmarking.
If we allocate 2 FPS per camera for object detection inferencing, according to Appendix section “Decision 1: Determining FPS for Object Detection“, we can inference 9 cameras in real time on Nvidia Xavier NX.
5.2.3 Decision
AI Inference 1 module object detection standalone hardware of choice: Nvidia Xavier NX
5.3 AI Inference 2: Tracking– Individual
Multi-object tracking is the second most used AI building block in a Video AI pipeline. In our experience and through an internal study of 400+ AI use cases across 15 Industries, Relevant Multi-object tracking, and Object detection AI building blocks alone can solve for 70–90% use cases that arises across 15+ Industries and at least 90% of use cases required Object detection and multi-object tracking AI building block to be present.
Object detection is a brutally concurrent stateless computation. Unlike it, multi-object tracking is highly stateful and sequential that operates on data generated through time step synchronized short spatio temporal window. It involves complex pattern recognition logic that uses non simple merges and branches over a lot of auxiliary data. Even though parts of it can be parallelized, the bottleneck arises with parts that cannot be and hence trades off the possible gains. E.g., Dependency of current frame’s prediction based on past “N” temporal window i.e. Last “N” frames predictions or when search and match space for various objects gets large especially in spatial (current frame) and temporal (feedback to correct IDs) space.
Hence this section of computer vision is not completely dominated by GPUs. Conversely there are lot of multi-object tracking architectures that rely heavily on only CPUs or rely on CPUs than GPUs and perform on par or better than GPU only counterparts and yet have high prediction throughputs compared to pure GPU based architectures.
Please refer to Annexure 9 for more detailed information on System design, design decisions and its impact on performance.
Incoming Video Stream Properties: H.264 encoding, 25 FPS, 1080p resolution
5.3.1 Observations
The input video stream consistently had approximately 15 objects in the field of view.
As outlined in Appendix section “Use case: People Counting“, due to its stateful sequential workload, the Intel 10th Generation i7 NUC significantly outperforms the Nvidia Xavier NX. If we allocate 15 FPS per camera for multi-object tracking, per Appendix section “Decision 2: Determining FPS for Multi-object Tracking“, we can inference on 8 live video cameras on the Intel 10th Generation i7 NUC, but only 3–4 live video cameras on the Nvidia Xavier NX.
5.3.2 Decision
AI Inference 1 module object tracking standalone hardware of choice: Intel 10th Generation i7 NUC
5.4 Application Logic — Individual
Operates on metadata of frames generated from various building blocks in pipeline.
Does not require the Image frame itself for processing the application logic. These application logics are a combination of both sequential and parallelism.
Please refer to Annexure 9 for more detailed information on System design, design decisions and its impact on performance.
Incoming Video Stream Properties: 1000 FPS, 10–20 objects in frame
5.4.1 Observations
The input video stream consistently had approximately 14–15 objects in the field of view.
Due to its stateful sequential workload, the Intel 10th Generation i7 NUC significantly outperforms the Nvidia Xavier NX, primarily due to the Intel’s superior CPU.
The application logic testing includes database operations, filters on streaming data, stream processing, and stateful sum of all object counts on individual frames in chronological order.
5.4.2 Decision
Application logic module standalone hardware of choice: Intel 10th Generation i7 NUC
6. Overall Pipeline Benchmarks
6.1 Intel 10th Generation i7 NUC
6.1.1 Summary
Ability to Run Reference Pipeline
We were able to run all of the following blocks of the reference pipeline workload on the Intel 10th generation i7 NUC:
- 5x cameras of real-time people counting use case
- 5x cameras video recording
- 1x camera live AI result/video streaming
- Evidence management system instance
- Configuration, maintenance, and dashboard tool instances.
System Utilization
For the above workload, we observed the overall utilization as:
- CPU: 9.5–10 vCPUs out of 12 vCPUs
- GPU: 95–100%
Stability
At ~80% of utilization, the system can be stable for prolonged periods of time, even at the workloads listed above.
6.2 Intel 11th Generation i5 NUC
6.2.1 Summary
Ability to Run Reference Pipeline
We were able to run all of the following blocks of the reference pipeline workload on the Intel 11th generation i5 NUC:
- 3x cameras of real-time people counting use case
- 3x cameras video recording
- 1x camera live AI result visualization video streaming
- Evidence management system instance
- Configuration, maintenance, and dashboard tool instances
System Utilization
For the above workload, we observed the overall utilization as:
- CPU: up to 6.9–7 vCPUs out of 8 vCPUs
- GPU: 95–100%
Stability
At ~80% of utilization, the system can be stable for prolonged periods of time, even at the workloads listed above.
6.3 Nvidia Xavier NX
6.3.1 Summary
Ability to Run Reference Pipeline
We were able to run only partial blocks of the reference pipeline workload on the Nvidia Xavier NX:
- 3x cameras of real-time people counting use case
- 3x cameras video recording
- 1x camera Live AI result visualization video streaming
- Evidence management system instance
We were unable to run the remainder of the workload due to lack of resources on the Nvidia hardware.
System Utilization
For the above workload, we observed the overall utilization as:
- CPU: up to 5.5 vCPUs out of 6 vCPUs
- GPU: 92% utilization
Stability
The overall utilization is too close to the hardware capacity limit. To achieve a realistic operational scenario, the system would need to be retain at least 1 available vCPU to be stable for prolonged operation, even with just the workloads listed above.
6.4 Nvidia Xavier AGX
6.4.1 Summary
Ability to Run Reference Pipeline
We were able to run all of the following blocks of the reference pipeline workload on the Nvidia Xavier AGX:
- 4x cameras of real-time people counting use case
- 4x cameras video recording
- 1x camera live AI result visualization video streaming
- Evidence management system instance
- Configuration, maintenance, and dashboard tool instances
System Utilization
For the above workload, we observed the overall utilization as:
- CPU: up to 6.9 vCPUs out of 8 vCPUs
- GPU: 96% utilization
Stability
At ~80% of utilization, the system can be stable for prolonged periods of time, even at the workloads listed above.
6.5 Hardware of Choice
Evidence from the above stages of performance analysis suggests that,
End-to-end reference pipeline execution hardware of choice: Intel 10th Generation i7 NUC
AI solutions are not just about a bunch of model’s throughput numbers but rather about inclusivity in performance over all heterogeneous tasks that form an complete AI solution which can solve an problem end to end.
The performance from intel and the conclusion comes from back of a strong CPU and integrated GPU performance in all round heterogeneous tasks that form an complete AI solution rather than peaking performance in only one type of computation or stage. Some of such ecosystem of AI allied tasks can be complex and unpredictable to pre-design / bake optimizations for. They require agility and ask to handle heterogeneous workloads effectively rather than focusing on one task of AI inferencing that is where Intel’s hardware shines.
7. Total Cost of Ownership Analysis
Total Cost of Ownership (TCO) is a critical factor to consider in investments made for AI workloads. It can be more critical than any other type of workload because AI Inference workloads are compute-intensive and AI infrastructure continues to be expensive. This expense further amplifies with scale.
If the TCO of your infrastructure is not understood and managed well, both at the time of purchase and during continued operation, the AI innovation may not end up be worth the capital expense outlay, despite having unprecedented business value.
This section analyses the overall TCO of various Edge-AI Hardware, to understand the actual costs without overlooking the hidden cost items that can present themselves later on.
Some of the key characteristics that form decision points in calculating Edge-AI infrastructure Total Cost of Ownership are as follows:
Objective
- Hardware costs for the end-to-end solution
Subjective
- Cost of repurposing: How much flexibility the solution offers while running various different kinds of workloads and architectures
- Future proofing: Will the compute solution today have sufficient performance to run tomorrow’s requirements
Expandability
- Ability to support non-AI workloads in parallel with Ai workloads
- How expensive or flexible it will be to expand current capabilities
7.1 Objective Total Cost of Ownership Analysis
For Benchmarking purposes briefed in this article, Both Intel i5 & Intel i7 considered were of 8 GB RAM variant, Nvidia Xavier NX was 8 GB RAM variant and Nvidia Xavier AGX was 32 GB RAM variant.
7.1.1 Intel i7 NUCs
7.1.2 Intel i5 NUCs
7.1.3 Nvidia Xavier NX
7.1.4 Nvidia Xavier AGX
Conclusion
After the ‘Contextual Power’ Intel CPU and ‘Brute Power’ Nvidia GPU hardware were put to subjective and objective benchmark evaluations (under the conditions outlined in earlier sections), it is uplifting to see that out of the box Intel edge CPUs perform strongly against the edge Nvidia GPUs on the given Edge-AI tasks. The top ranking of the Intel CPUs becomes even more apparent when the Total Cost of Ownership and their performance on both AI and non-AI workloads are included.
The key lesson for vendors is: It is not in a company’s best interest to rely on the TOPS or TFLOPS performance measurements as the single most important data point in assessing AI processing potential of any hardware. A belief bias in this direction undermines achieving a realistic understanding of the best Edge-AI solution design and performance.
Most of what has been presented in this article is not new to experienced Edge-AI developers. Instead, it stands as a guide to help end users and business management avoid the pitfalls of becoming focused on Chip makers’ marketing numbers to the exclusion of the broader picture. Decisions that include the totality of the system, choice and analysis of architecture/ design and performance requirements of various components will likely perform better, remain more future-proof, and achieve the best returns on capital and operational expenses.
Appendix
9.1 Benchmark Setting Details
Scene Change Building Block Forms Core of Motion Recognition
Scene Change Recognition can recognize movement in the scene, changes in the scene due to blur, camera position changes, disconnection of the input feed, occlusion of the lens, abnormal light thrown on the lens, drastic changes in the lighting conditions, etcetera. All these recognition routines are configurable.
Scene Change Building Block Used in Pipeline Has a Dual Role
- Standalone application: Provides customers with a fundamental stand-alone filter for motion events and scene-change events. This use case is considered essential for most end-user cases. Hence, it is important that it be able to run even on basic hardware with very minimal computational footprint.
- As a pipeline building block: Using several spatio-temporal similarity scores, this building block allows us to optimize the AI pipeline for several different use cases, to get better throughput and better hardware utilization based on scene dynamics, all while making sure these optimizations do not impact the native accuracy of the pipeline.
Important Note: For the purposes of benchmarking, this building block’s duality property (i.e., use of it within a pipeline) is turned ‘off’, since keeping it active would allow the entire benchmark to become highly subjective and dynamic towards the dataset being used for evaluation. This would provide an opportunity for vendors to bias their benchmarks by selecting a favorable dataset (more on that below). However, this duality property can be utilized in the field to get additional performance out of deployed Edge-AI hardware.
Emphasizing How Motion Recognition is Used for Optimizing the AI Pipeline
AI models and algorithms are extremely computationally expensive: The open-source, better-generalizable and state-of-the-art models for object detection run at 15–35 FPS on top-end AI inference-class GPUs for datacentres. This helps us understand how computationally taxing it would be to run similar models on Edge deployments which can generalize well.
For this article, we did not want to compromise accuracy and generalization in return for maximum throughput. Instead, we envisioned a way to find a balance of both.
When provisioning for real-world Edge applications, it is important to focus on the end goal of providing a generalized pipeline with relatively generalized building blocks that can be used in a more relaxed manner and encompass a wider range of use-cases. However, state-of-the-art models are important for maximum real-world performance.
For the scenario described below, motion recognition building block plays a key role in extracting higher overall performance from Edge-AI hardware by utilizing the hardware only for appropriate frames. makes it a possible to inference using state of the art AI models yet still generalize well on less-powerful Edge hardware.
Whenever an input video source is highly redundant, containing a large amount of temporal similarity (similarity across frames) and spatial similarity (similarity in appearance in a scene), the motion recognition building block quantifies such similarity between subsequent series of frames by assigning it a score. This allows the pipeline to autonomously decide whether to run a frame or sequence of frames through a computationally complex, state of the art, generalizable AI model, or on a less computationally intensive model/algorithm.
If subsequent frames have a high-variance / low-similarity score from the scene change recognition block, then those frames are inferenced by state of the art, generalizable AI models. These are called landmark inferences.
If subsequent frames have a low-variance / high-similarity score from the scene change recognition block, then those frames are inferenced by less computationally intensive AI models and the results are compared with previous landmark inferences from the state of the art, generalizable AI models. This comparison, to a larger extent, remains logically ‘true’ if the similarity scores of those continuous sequence frames rank at or above the developer’s set threshold.
For low-variance video sequences, frames are processed in one of two ways:
Accuracy is comparable to the last landmark inference:
If the inference accuracy of the computationally less-intensive AI model is comparable to that of the last landmark inference from state of the art, generalizable AI model, then the previous steps repeat until one of the following occurs:
- The similarity score across subsequent frames changes beyond a set threshold.
- The inference accuracy from the computationally less-intensive AI model begins offsetting inference predictions from the last landmark inference by a sufficiently large margin (developer set).
Accuracy is not comparable to the last landmark inference:
If the inference accuracy of the computationally less-intensive AI model is not comparable to that of the last landmark inference, then the inference of the frame in question (i.e., one that was inferenced by the computationally less-intensive AI model) is sent to the state of the art, generalizable AI model and this inference is set as the landmark inference. Then, the previous cycle continues. This approach drastically reduces computation over each frame for sparse scenes or scenes without many changes while still retaining high inferencing accuracy.
9.1.2 Use case: People Counting
People counting / object counting is the most popular use cases across different enterprises and enterprise verticals. It is often employed to count people within a zone, crossing a line, or to count objects that match a configured set of attributes (e.g., male, female, cartons, vehicles proceeding in a particular direction, etcetera).
Each type of counting can be done either per-frame (stateless) or across frames (stateful):
- Per-frame counting (stateless): This approach is considered ‘stateless’ since it must count unique objects within a single scene (i.e., the number of unique detections in the frame). This case does not involve tracking an object across multiple frames. This is the simplest pipeline to implement but often cannot cover the spectrum of insights that end users are expecting to get from their AI systems, due to the lack of object tracking across scenes / frames.
- Counting across frames (stateful): This approach is considered ‘stateful’ since it must count unique objects across a sequence of frames. This is possible by identifying an object and assigning it a unique ID in the initial frame(s) and then matching its presence across subsequent frames and assigning the same ID. Multiple unique objects can be tracked in this manner using a multi-object tracker model. This adds complexity compared to simple per-frame counting but is essential to cover the wide spectrum of insights that end users expect from their AI solution.
This benchmark includes the people/object counting across frames (stateful) use case. Therefore, both the Object Detection and Multi-object Tracker building blocks are utilized.
9.1.3 Object Detection and Multi-object Tracking
A decoded video feed is provided as the input to the object detection building block. Object detection localizes the presence of objects and its respective object type in a scene. It outputs bounding boxes over each object and returns the box’s coordinates and the object’s class.
All of the object detection results (i.e., all of the bounding boxes / ROIs over each object, object classes and coordinates) become inputs for the multi-object tracker block. Hence, it is the object detection’s consistent accuracy that drives the overall accuracy of the system.
If the object detector’s accuracy is sufficiently accurate with its selective frames (Input FPS), then the multi-object tracker can either maintain or improve the consistency of object detection, for example by interpolating and predicting on behalf of object detection during inconsistent detections.
Object detection is a highly parallel task, since it involves large, concurrent, stateless matrix operations. Therefore, GPUs shine and generally offer higher throughput than CPUs for most detection architectures. However, the gap is getting smaller. In Edge-AI scenarios, Edge-focused CPUs, with higher core counts and integrated GPUs, the performance gap narrows greatly when running state of the art, generalizable object-detection models, in comparison to a similar class of Edge-focused GPUs.
The multi-object tracker block assigns a unique ID to each object detected in a scene, then tracks the objects as they move throughout a video stream, including predicting each object’s new location in subsequent frames based on various attributes of the frame and the object.
A multi-object tracking block cannot be made entirely parallel, since they are highly stateful and thus rely on last ’N’ frames of data to predict the current frame’s object positions. Mostly, multi-object trackers are composed of multiple sub-algorithms, each is independently stateful (which inherently makes the block more complex). Even though the sub-algorithm parts can calculate independently in parallel, most of the multi-object tracking block is comprised of sequential parts. These parts require that computations from all previous parallel sub-algorithm parts complete before they can commence processing. Therefore, a bottleneck exists. These kinds of parallel-to-sequential stateful tasks generally perform better on a CPU than on a GPU of similar class.
There are a variety of methods for measuring a multi-object tracker’s accuracy, including:
- Mostly Tracked: Measure the time an object retains its originally-assigned ID throughout its lifetime inside the field of view.
- ID Switches: Measure how many times an object’s assigned ID switches during its lifetime inside the field of view. The better the algorithm, the fewer ID changes the object will have and hence the better the algorithm is capable of tracking.
How well a multi-object tracker retains its object ID in the following challenging scenarios decides how well the algorithm is likely to perform in real world settings:
- Partial or full Occlusion of an object in a scene
- Short disappearance of object and reappearance in the same scene
- Density of objects that cross each other’s trajectories
- Deformation of objects in a scene
- Nonlinear motion of objects in a scene
9.1.4 Decision 1: Determining FPS for Object Detection
What FPS should object detection be running at?
Input video sources used in our benchmarking was a live stream, with 1080p resolution at 25 FPS. Objects such as people (with an exception for fast-moving objects such as cars or bikes moving at higher speeds) do not move significantly in spatial coordinates within a second, Therefore, it would be redundant and wasteful to perform object detection inferencing for every input frame (i.e., all 25 frames every second). Object-detection models used in our use case is a state-of-the-art model which generalizes well across diverse deployment environments. By its nature, this type of model is more computationally intensive, so inferencing only a subset of the incoming frames is important for extracting the best performance. This selective-processing technique also allows lower-speed Edge CPUs to handle a larger number of cameras.
The decision on how many frames per second to process for object detection inferencing should be based on system performance factors, such as:
- How performant is your multi-object tracker, especially in the context of lower versus higher FPS object detection.
- Does higher FPS object detection necessarily translate into better object tracking accuracy?
If so, up to what FPS limit? What is the cutoff limit beyond which increasing object detection
FPS does not impact multi-object tracking reasonably? - How many objects are anticipated to be present in a scene at any given time?
- Similarity of object visual features, overlapping positions and occlusions of objects.
- Speed of movement of objects in a scene.
- Understanding the type and strength of occlusions of the objects and the necessity to retain the originally-assigned unique ID of each object being tracked, despite occlusions.
The typical sweet spot for object detection FPS is around 2–5 FPS, for most real-world non-fast-moving objects (e.g., persons), and around 5–15 FPS for fast-moving objects (e.g., vehicles). The lower end of the FPS range is advised only when the subsequent building block (i.e., the multi-object tracker) can reliably maintain sufficient accuracy across all scenarios, such as those mentioned in the section “Object Detection and Multi-object Tracking” above. For other scenarios, settling midway might provide the best overall accuracy.
In our benchmarking, experiments showed that the accuracy of the multi-object tracker while people counting object detection, due to its high-performance design, saturates above 2 FPS in many scenes in a standard retail setting. No reasonable real-world tracker accuracy gain was noted at higher than 5 FPS, even when processing a dense crowd. Therefore, we suggest 2 FPS for object detection throughput per camera for people-counting use cases and something higher than 2 FPS whenever the use case is counting faster-moving objects such as vehicles.
9.1.5 Decision 2: Determining FPS for Multi-object Tracking
What FPS should multi-object tracking be running at?
A multi-object tracking algorithm predicts the location of objects in a scene for a current frame based on the position of that same object across the past ’N’ contiguous frames. The more contiguous frames processed, better the results should be. Our multi-object tracker performs better when ’N’ is 10–15 contiguous frames. Our prior experience with multiple academic and industry peers also suggests the sweet spot to be in range of 10 frames, to begin with. In our case, we choose ’N’ as 15 since we wanted to have the best performance even in situations where occlusions, density of objects and motion within a scene are high.
9.1.6 Decision 3: Determining the Input Stream FPS
What FPS should the input video stream be ingested at?
From the two decisions above, it is evident that our maximum FPS requirement from a single camera stream is 15 FPS. In this situation, it makes sense to request a 15 FPS input video stream rather than a 25 FPS video stream. The higher the FPS of the input video stream, the more computational cost is involved in transcoding and transrating. By limiting the input video stream to 15 FPS, we save 40% on our decoding computation costs.