In this post we’ll introduce and explore KineticFlow™, Ghost’s computer vision neural network for obstacle and road environment perception. We’ll summarize here how KineticFlow works and what makes it distinct, and in follow-on posts we will dive deeper into each attribute, as well as show you it in action.
In case you missed it – check out the first post in our computer vision blog series which gives some background on typical mono-depth approaches, and discusses some of the challenges in adapting these legacy ADAS/L2 vision approaches to true attention-free (L4) self driving.
Introducing KineticFlow – the universal vision neural network
KineticFlow is the main visual perception network of the Ghost Autonomy Engine that is used to identify actors and features of the road, and calculate their distance, velocity, and direction of motion. The network uses a pair of cameras as inputs, analyzing video sequences with a combination of both mono and stereo computer vision techniques to produce per-pixel measurements. KineticFlow uses a physics-based approach to AI, enabling it to identify objects universally without specific training. This physics-based approach not only delivers universal recognition, but it enables scalable and fully-automated data center training, and high-efficiency in-car execution on standard low-power SoCs.
As a single vision-based neural network, KineticFlow outputs the four main primitives required for driving – detecting relevant objects (without explicit classification or identification), and calculating distance, velocity, and motion direction for each. While these measures are most interesting for the relevant/actionable objects in the scene, KineticFlow is capable of supplying these metrics per-pixel. Each supplied metric is also assigned a confidence factor, which helps the driving program downstream better-interpret the outputs from KineticFlow in concert with outputs from other sensor modalities to appropriately perform sensor fusion and ultimately make driving decisions.
How KineticFlow works: physics-based vs. image-based
Most traditional approaches to AI for self-driving involve image recognition – explicitly training a neural network with millions of images of road actors (e.g., cars, trucks, buses, humans), in every color, rotation, and weather/lighting condition. This is not only laborious and expensive, but the real challenge is that it’s impossible to be exhaustive – literally anything can be on the road and could cause an accident someday if it’s not recognized properly. See the first post in this series to understand the dangers of mis-recognition further.
KineticFlow takes a fundamentally different, physics-based approach, and sees the world as a series of surfaces or planes. Every actor and element of the scene can be broken-down into planes, which each have universal properties of light and motion which are governed by the laws of physics.
KineticFlow executes a series of analysis steps as part of the neural network. First the scene is divided into a series of discovered planes. The planes which comprise the road are then identified, helping to disambiguate things that are actually part of the road (say paint on the road for markers), things that rest on top of the road (vehicles, obstacles, things that need to be reasoned about), and things that are well above the road (bridges, signs, things that can usually be driven under and otherwise ignored). The ground plane and road marker identification are used for later scene understanding for our Scene Canonicalization Network, which will be the subject of a future post.
Once planes that are likely part of objects/obstacles are detected (but not identified or recognized, in the classic “image recognition” sense), KineticFlow observes all the planes in the scene over time, and reasons that planes that are consistently moving together are part of the same object. This physics-based approach enables KineticFlow to observe all the elements necessary for later scene understanding as well as detect all the key actors in the scene. While Ghost may choose to use additional neural networks to derive higher-level understanding (recognizing people, VRUs, signs, or tail lights, for example), no explicit recognition is required for the fundamentals of driving – understanding there are objects on the road and how they are moving, so that Ghost can avoid them. This universal approach to recognition is Ghost’s baseline of delivering safe perception, and elegantly eliminates solving the “long tail” training problem that plagues traditional image recognition approaches.
Multiple complementary algorithms – one neural network
Although KineticFlow operates on the raw outputs of at least two cameras, it’s not strictly correct to call it a “stereo” network. KineticFlow fuses multiple mono and stereo computer vision algorithms in one network, producing better self-reinforcing results than either a mono or stereo approach could deliver independently. KineticFlow uses stereo disparity from cameras to calculate depth and ultimately distance, and mono 3D optical flow algorithms to calculate velocity and motion using pixel expansion. Multiple disparity-based distance measurements can also be analyzed by the driving program over a short period of time to independently calculate velocity and motion.
It's important to realize that the above algorithms aren’t exhaustively run in real-time in the car, rather they are run on training data in the data center, often using many minutes of wall clock time on powerful processors per video frame. These algorithms are used to richly label training data, which is then used to train a neural network that runs in the car in real-time. And because these algorithms are run with painstaking precision in the data center, they can learn and reinforce one another’s results, and the training process can take advantage of forward and backward time (i.e. a prediction that was made in one frame at a distance can be verified several frames later to see how accurate it was).
So thus in one sentence, KineticFlow is a universal neural network, that is trained in the data center using highly accurate and compute-intensive mono and stereo computer vision algorithms, that improves its accuracy with multiple algorithms and by analyzing video sequences over time, to in the end produce a neural network capable of outputting object detections, distance, and motion, in a matter of mere milliseconds running on low-power SoC compute.
Optimized for scale and resiliency – HD cameras and standard SoCs
KineticFlow was designed around an opinionated hardware stack, optimizing the neural network from the start for the latest-generation sensors and compute to deliver an ideal balance of performance, cost, and power efficiency.
Other systems often use 2 or 3 mono cameras, each focused on a different distance and field-of-view. This not only adds cost, but makes resiliency a challenge as there is only one camera per distance. KineticFlow leverages high-definition cameras (12MP in our first generation, 48MP in our second-generation hardware), enabling a single camera pair to see both wide at a near-distance as well as 100s of meters down the road, delivering the perception necessary for high-speed freeway driving. Although KineticFlow primarily functions with stereo cameras, it is still capable of producing useful outputs from a single mono camera. Since Ghost systems feature camera + radar, this means that in many failure or occlusion scenarios KineticFlow can still provide enough information to continue driving or execute a safe stop as part of a minimum risk condition.
Traditional vision systems require specialty ASICs to process rich computer vision algorithms in real time, while some newer approaches are bringing 100s of TOPs of GPU power to run complex AI models in the car. These designs suffer from rigidity (as it’s difficult to change what has been baked into an ASIC), cost, and energy efficiency challenges. KineticFlow in contrast is optimized for low-power, low-cost SoC compute (which features integrated CPU, GPU, and TPUs). The same SoCs are used both in-car for execution and in-datacenter training validation, ensuring that KineticFlow functions the same on the road as it was trained in the data center.
Now that we’ve explained the basics of KineticFlow, future blog posts in this series will cover more depth on how KineticFlow is trained, how its accuracy is validated, and how the physics-based approach delivers universal perception. In the meantime, feel free to download our Ghost Autonomy Engine technical brief to learn more.