current position:Home>Why does Tesla stick to the pure visual route?

Why does Tesla stick to the pure visual route?

2022-02-02 20:51:48 TechWeb

【TechWeb】 In recent days, , Tesla China shared its ideas and research progress of using pure vision scheme with the media offline .

Adhere to visual perception use AI Neural network technology improves the ability of assisted driving

Pictured 1 Shown ,Andrej say :“ We hope to build a neural network connection similar to the animal visual cortex , Simulate the process of brain information input and output . It's like light entering the retina , We hope to simulate this process through the camera .”

chart 1 Schematic diagram of human image processing process simulated by camera

Multi task learning neural network architecture HydraNets, Through a backbone network 8 Raw data from a camera , utilize RegNet Residual network and BiFPN Unified processing of algorithm model , The characteristics of various types of images with different accuracy are obtained , It is used to supply neural network tasks of different demand types .

chart 2 Multi task learning neural network architecture HydraNets

However, because this structure deals with the single frame picture of a single camera , There are many bottlenecks in practical application ; So... Is added to the sub structure Transformer Neural network structure , Make the originally extracted two-dimensional image features , Become the feature of three-dimensional vector space combined by multiple cameras , This greatly improves the recognition rate and accuracy .

It's not over yet. , Because it's still a single frame , So we also need time dimension and space dimension , So that the vehicle has the characteristics “ memory ” function , For response “ Occlusion ”、“ Road signs ” And so on , Finally, it is realized in the form of video stream , Extract the features of the driving environment , Form a vector space , So that the vehicle can accurately 、 Judge the surrounding environment with low delay , formation 4D Vector space , The database of these video form features is used for training automatic driving .

chart 3 Video 4D Neural network architecture in vector space

However, due to the difference between urban automatic driving and high-speed automatic driving , The vehicle planning module has two major problems , One is that the driving scheme does not necessarily have an optimal solution , There will be many local optimal solutions , That means the same driving environment , Autopilot can choose from many possible solutions , And they are all good plans ; The second is the higher dimension , The vehicle not only needs to react now , We also need to plan for the next period of time , Estimate the location space 、 Speed 、 Acceleration and so on .

So Tesla chose two ways to solve the two problems of planning module , One is to solve the local optimal solution by discrete search “ answer ”, per 1.5 millisecond 2500 Super efficient execution of this search ; The other is to use continuous function optimization to solve high-dimensional problems . A global optimal solution is obtained by discrete search , Then continuous function optimization is used to balance the demands of multiple dimensions , For example, comfort 、 Ride comfort, etc , Get the final planning path .

Besides , In addition to planning for yourself , still more “ Estimate ” And guess the planning of other objects , In the same way , Based on the recognition of other objects and the basic speed 、 Acceleration and other parameters , Then plan the route for other vehicles , And deal with .

But road conditions around the world are changing , Very complicated , If the discrete search method is adopted, it will consume a lot of resources , And make the decision-making time too long , Therefore, we choose the way of deep neural network combined with Monte Carlo search tree , Greatly improve the efficiency of decision-making , Almost an order of magnitude gap .

chart 5 Efficiency in different ways

The overall architecture of the final planning module is shown in the figure 5, Firstly, based on the architecture of pure vision scheme, the data is processed into 4D Vector space , Then based on the previously obtained object recognition and shared feature data , Then the depth neural network is used to find the global optimal solution , The final planning result is handed over to the executing agency for execution .

chart 6 Visual recognition + planning 、 Execute the overall architecture

Of course , The best neural network architecture and processing method , Are inseparable from an effective and huge database . In the data from 2D towards 3D、4D In the process of transformation , about 1000 The multi person manual annotation team is also keeping pace with the times 4D Dimensioning in space , And only after labeling in vector space , It will be automatically mapped into a specific single picture of different cameras , Greatly increase the amount of data annotation , But that's not enough , The amount of manually marked data is far from enough to feed the amount of training required for automatic driving .

chart 7 4D Demonstration of manual annotation in vector space

Because people are better at semantic recognition , And computers are better at Geometry 、 Triangulation 、 track 、 Reconstruction, etc , So Tesla wants to create a human and computer “ Harmonious division of labor ” Mode of common annotation .

Tesla built a huge automatic labeling pipeline , use 45 second -1 Sub video , Including a large amount of sensor data , Give it to neural network offline learning , Then, a large number of machines and artificial intelligence algorithms are used to generate annotation data sets that can be used for training networks .

chart 8 Video clip automatic annotation processing flow

For areas that can be driven, such as roads 、 Track line 、 Identification of intersections, etc , Tesla used NeRF“ Neural radiation fields ”, That is, one 2D towards 3D Transformed image processing algorithm , Give a given XY Coordinate point data , Let the neural network predict the height of the ground , This generates countless XYZ Coordinates , And various semantics , For example, roadside 、 Lane line 、 Road surface, etc. , Form a large number of information points , And project it back into the camera picture ; Then compare the road data with the image segmentation results recognized by the neural network , And optimize the images of all cameras as a whole ; Combine time dimension and space dimension at the same time , Create a perfect reconstruction scene .

chart 9 Demonstration of road reconstruction

Using this technology, the road information reconstructed by different vehicles passing through the same place , Cross compare , They must be on the same information at all locations , To predict correctly , Under such joint action , An effective marking method of road surface is formed .

chart 10 Multiple video data labels overlap and check each other

This is totally different from high-precision maps , As long as the annotation information generated by all video clips is more and more accurate , The marked information is consistent with the actual road conditions in the video , You don't have to maintain this data anymore .

Using these technologies at the same time , It can also recognize and reconstruct static objects , And textured 、 No texture can be based on these 3D Mark the information points ; These marking points are very useful for the camera to recognize any obstacle .

chart 11 Of static objects 3D Information point reconstruction

Another benefit of using offline processing of these data and annotations is , The bicycle network can only predict other sports at a time , While offline, due to the fixed line of data , You can know the past and the future , According to certain data , Ignore occlusion or not , The speed of all objects 、 Acceleration prediction and calibration optimization , And mark , The training network later judged other sports more accurately , It is convenient for the planning module to plan .

chart 12 Offline to vehicle 、 Pedestrian speed 、 Acceleration calibration and marking

Then combine these , It forms a pair of video data , All road related 、 Recognition of static and dynamic objects 、 Anticipation and reconstruction , And mark the dynamic data .


chart 13 Reconstruction and annotation of the surrounding environment by video clips

Such video data annotation will become the core part of training automatic driving neural network . One of the projects is in 3 months , Use this data to train the network , All functions of millimeter wave radar are successfully realized and more accurate , So the millimeter wave radar is removed .

chart 14 When the camera can hardly see , The judgment of speed and distance is still accurate

It is verified that this method is highly effective , Then we need massive video data to train . So at the same time , Tesla also developed “ Simulation scene technology ”, It can simulate the less common “ Edge scenes ” For automatic driving training . Pictured 4 Shown , In the simulation scenario , Tesla engineers can provide different environments and other parameters ( obstacle 、 Collision 、 Comfort, etc ), It greatly improves the training efficiency .

chart 15 Simulation scenario

Tesla uses simulation mode to train the network , It has been used 3 Billion images and 50 Billion tags to train the network , Next, we will use this model to continue to solve more problems .

chart 16 The improvement brought by the simulation mode is expected in the coming months

Sum up , If you want to improve the ability of automatic driving network more quickly , Need to deal with a large number of video clips and operations . A simple example , To get rid of the millimeter wave radar , Just deal with it 250 Ten thousand video clips , Generated more than 100 Billion labels ; And these , Let hardware become the bottleneck of development speed more and more .

Previously, Tesla used a set of about 3000 block GPU、 Slightly below 20000 individual CPU Training hardware , And for simulation, I also added 2000 More than one FSD Computer ; Later it developed to 10000 block GPU The world's fifth largest supercomputer , But even so , It's not enough .

chart 17 The parameters and changes of supercomputers currently in use

So Tesla decided to develop its own supercomputer .

“ The pioneering work of Engineering ”——D1 Chip and Dojo supercomputer

The present , As the data to be processed begins to grow exponentially , Tesla is also improving the computational power of training neural networks , therefore , Tesla Dojo supercomputer .

Tesla's goal is to achieve the ultra-high computing power of artificial intelligence training , Dealing with large and complex neural network patterns 、 At the same time, expand the bandwidth 、 Reduce the delay 、 Cost savings . This requires that Dojo The layout of supercomputers , To achieve the best balance between space and time .

As shown in the figure , form Dojo The key unit of supercomputer is the neural network training chip independently developed by Tesla ——D1 chip .D1 The chip adopts distributed structure and 7 Nanotechnology , carrying 500 100 million transistors 、354 Training nodes , The internal circuit alone is as long as 17.7 km , It realizes super computing power and ultra-high bandwidth .

chart 18 D1 Chip technical parameters

chart 19 D1 Chip live display

As shown in the figure ,Dojo The single training module of the supercomputer consists of 25 individual D1 Chip composition . Because each D1 The chips are seamlessly connected , The delay between adjacent chips is very low , The training module realizes the bandwidth reservation to the greatest extent , With the high bandwidth created by Tesla 、 Low latency connector ; In less than 1 In cubic feet , Up to 9PFLOPs(9 Billions of times ),I/O Bandwidth up to 36TB/s.

chart 20 D1 Training module composed of chip

chart 21 The training module is displayed on site

Thanks to the independent operation ability and unlimited link ability of the training module , Composed of Dojo The performance expansion of supercomputers is theoretically unlimited , It's a true “ Performance beast ”. Pictured 9 Shown , Practical application , Tesla will 120 A training module is assembled into ExaPOD, It is the world's leading artificial intelligence training computer . Compared with other products in the industry , Its performance is improved at the same cost 4 times , Performance improvement under the same energy consumption 1.3 times , Space savings 5 times .

chart 9 The training modules are combined into ExaPOD

Matching powerful hardware , It is a distributed system developed by Tesla ——DPU(Dojo Processing Unit).DPU Is a visual interactive software , The scale can be adjusted according to requirements at any time , Efficiently process and calculate , Data modeling 、 Storage allocation 、 Optimize the layout 、 Partition expansion and other tasks .

soon , Tesla is about to start Dojo The first assembly of supercomputers , And from the entire supercomputer to the chip 、 System , Further improvement . For AI technology , Musk obviously has a bigger pursuit . This pursuit , In his opening remarks “ We have a technical problem , I hope I can use AI To solve ” The teasing of , What's more, he promised at the end of the activity “ We will further explore the whole human world ” Commitment .

copyright notice
author[TechWeb],Please bring the original link to reprint, thank you.

Random recommended