2021 year 12 month 16 Japan , The fourth time TVMCon Held online as scheduled . Starting from the starting point of the industry more than four years ago ,
Deep learning Compilation technology has also been widely involved in the industry from the initial embryonic stage . We've also seen a lot of interesting discussions about compilation and optimization techniques .
Deep learning The vigorous development of compilation technology ecology also makes us have many new thoughts .TVM As the first to get through
Deep learning The compilation technology framework can already provide you with a lot of value . But it's like
Deep learning The framework itself will experience from the first generation （caffe） To the next two generations （TF, pytorch） The changes are the same , We know very well
Deep learning Compilation technology itself also needs to undergo several generations of evolution . So we started asking this question two years ago ：“ at present
Deep learning Where is the bottleneck of compilation technology , What is the next generation technology “.
Fortunately , As the earliest architecture to open up the compilation process, we can directly observe the experience that can only be seen by trying to integrate . And the ecology itself that is now beginning to flourish (MLIR Various dialects,XLA, ONNX, TorchScript) It also began to give us a lot of goals that we can learn and refer to . This article is a comprehensive review of the past two years
Deep learning Summary and thinking of compilation and hardware acceleration Ecology , It's also our concern about
Deep learning Technical prospect of the new generation of Compilation Technology , Hope to have some reference value for you . Most of the content of this article comes from our work in TVMCon The theme report of .
What is the status quo ： At present Deep learning Compile solutions and bottlenecks
Four kinds of abstraction
Deep learning From the embryonic stage to the stage of prosperity and growth . But in terms of implementation , Now?
Deep learning The compilation ecology revolves around four types of abstractions :
The calculation diagram shows (computational graph)： The calculation chart can put Deep learning The program is expressed as DAG, Then we do something similar to operator fusion , rewrite , Advanced optimization such as parallelism .Relay, XLA, Torch-MLIR,ONXX Basically at this level .
Tensor program representation (tensor program): At this level, we need to cycle through subgraphs , about DSA The support also includes the optimization of Zhang quantization and memory movement .
Operator library and running environment (library and runtime): The operator library itself is still a way for us to quickly introduce expert input to optimize performance . At the same time, the running environment quickly supports the data structure runtime .
Hardware specific instructions (hardware primitive) ： Dedicated hardware and programmable Deep learning The accelerator also introduces the need for our dedicated hardware tensor instructions .
Deep learning Compilation ecology basically revolves around the implementation of various abstractions . The above figure is an incomplete summary of the current ecology . It's worth pointing out , Operator library and running environment are not directly related to compilation technology . But because our goal is to accelerate the deployment and operation of new hardware , Running environment abstraction and hardware instructions are also an important part of ecology .
We need multiple layers of abstraction, which is basically
Deep learning A consensus reached in the field of compilation and optimization . But to really support
machine learning , It's not enough to rely on the components themselves . The biggest problem we found was “ How to design abstraction at all levels , And integrate them effectively ”.
The current solution
If we study the whole current ecology carefully , Include
Deep learning frame , Compile framework ( Including based on MLIR,ONNX Or is it TVM Solutions for ）. We all follow a method called multi-layer progressive optimization (Multi-stage lowering) The way . The general idea of this construction method is that we adopt one in each layer of abstraction （ Sometimes more than one ） In the middle . We will at every level (dialect, abstraction） Do some internal optimization , Then leave the problem to the next level and continue to optimize .
Because of ecological design and other reasons , Of course, the solution basically has the following characteristics ： Each level of abstraction is basically maintained by a relatively independent group . Layers are often loosely coupled . Finally, the solution itself is often presented to users in the form of a black box tool .
A vision of many people is that as long as the optimization of each layer is done well , Put things together , We can combine it into a solution that meets our needs . Although this approach can indeed achieve certain results , But we also need to ask ,multi-staging lowering It's really enough to solve
Deep learning Optimization problem ?
Two kinds of estrangement
We finished based on... About three years ago multi-staging lowering Solutions for . After we set up the whole stack solution and put it into practice, we found that there are two barriers hindering the development of the whole industry .
The first vertical barrier blocks the manual optimization scheme and the automatic compilation optimization scheme . If we look at the current
Deep learning Run framework and compile framework . We can find two schools ： One is the operator library driven scheme based on manual operator optimization . This kind of scheme can easily let more optimization experts join , But it will also bring more engineering expenses . Another kind of scheme is the compilation scheme based on automatic optimization . Compiling core schemes can often bring more automatic effects , But sometimes it is difficult to introduce domain knowledge . Most current frameworks are basically designed for only one of the two . And we often find that a better solution actually needs
machine learning Engineer input and Automation . And with the development of the field , The optimal solution will also change . How to break such a vertical wall , Let's optimize it manually ,
machine learning The knowledge of optimization experts and automatic optimization are organically integrated , It is also a big problem facing the industry at present .
In addition to the vertical high wall , The second kind of estrangement is similar to multi-stage lowering Is directly related to the ecology of . Because we tend to separate different levels of abstraction to design , Although more flexible optimization can be done inside the abstraction . However, the transformation from one abstraction to another often needs to pass translator perhaps lowering Batch conversion . This leads us to many difficulties, and we begin to focus on the boundary between one kind of abstraction and another kind of abstraction , This leads to if we want to do some step-by-step optimization on the boundary （ For example, give one of the molecular graphs to a class of compiler logic , Leave the rest to other compilers, such as logic ） We have to introduce a lot of Engineering on the boundary . Another common phenomenon is that such transformations are often one-way , We usually do some optimization on high-level abstractions such as computational graphs , Then pass it to the tensor calculation level . However, tensor computing or hardware level information is often difficult to feed back to higher levels . for instance , In fact, many times, the optimization of tensor program itself can in turn guide the operator fusion and data arrangement at the calculation layer level , However, the current one-way architecture is difficult to make natural use of this kind of feedback .
Sum up our experience .
Deep learning Compilation and optimization itself is not a problem that can be optimized at one level . Solving related problems requires the linkage between various levels of abstraction . With TVM and MLIR The emergence of a kind of infrastructure , In fact, we can easily build a certain level of abstraction or dialect And let them pass through multi-stage lowering From high-level to prefecture level abstraction, carry out layer by layer transformation and transformation . But now the difficulty often appears in the abstract transformation boundary . Whether it is introducing more modular integration transformation at the boundary , Or try feedback iterations ,multi-stage lowering It's not enough in itself . More Than This , As the level of abstraction increases , If you want to add custom operators , We often need further architecture for each level of abstraction , The cost becomes even greater .
Because of the vertical and horizontal barriers between abstractions . No matter how well we do a layer of abstraction, the interior itself , It is still difficult for us to do an end-to-end overall optimization . It should be noted that , The existence of these barriers and problems has nothing to do with the choice of infrastructure , Whether it's based on MLIR,ONNX Or is it TVM The plan , Once adopted multi-stage lowering Will inevitably face this problem . I believe that small partners working in the field, no matter what infrastructure they adopt , After getting through the scheme, we will encounter this essential problem more or less .
Where is the future ： From arrow to circle
In this section, we will introduce our thoughts on this issue after more than two years of summary . This is also our evolution to a new generation
Deep learning The core technical route of compiling system , We call this route TVM Unity.
We can notice that , Almost all difficulties arise from boundary estrangement , Therefore, the key point we need to solve is to grasp the key points , Eliminate border barriers . Our goal is to put a one-way arrow multi-stage lowering programme , Evolve into a circle that allows organic interaction between abstractions .
The whole technical route includes three key points :
Unify: Unified multi-layer abstraction
Interact: Interactive open iteration
Automate: Automatic optimization and integration
Unify: Unified multi-layer abstraction
In order to break the direct barrier of abstraction , The first step we need to complete is to unify the abstraction . Of course, this does not mean that we need to design a unique level to solve all problems – There was a slight mistake in the architecture , We might design a complex representation that integrates all the abstract short boards . Here we still need to recognize the importance of each kind of abstraction , But we need collaborative design between different kinds of abstractions , And each level can interact with other levels .
Specific to the TVM for ,TVM The main focus is on four abstractions .AutoTensorization It is used to solve the docking between hardware instruction life and tensor program ,TVM FFI（PackedFunc） Mechanism allows us to flexibly introduce arbitrary operator library and runtime functions and call each other in each compilation module and user-defined module .TensorIR Responsible for the integration of tensor level programs and hardware tensor instructions .Relax (Relax Next) Will introduce relay Further iterations , Directly introducing first class symbolic shape Support for . But just as mentioned at the beginning , The key point here is not just at each level of abstraction itself , But the interaction and joint optimization between abstractions .
The above program example shows our unified abstract design goal . This is a pass TVMScript It means IRModule.MyIRModule It contains two functions . among tir_mm It's a TensorIR Level functions . A new generation TensorIR The goal of the design is to achieve Linkage between tensor program and hardware specific instructions through automatic tensor quantization . Look again relax_func This function . There are several key points .R.call_dps((n, m), tir_mm, [x, w]) At the level of calculation chart, it is directly for TensorIR function tir_mm Direct call to . And we make the tensor program still appear in the computational graph in the same way as the graph operator through a special call form . Supporting this feature means that the computational graph needs to be jointly designed with the tensor program level representation , So that the optimization of calculation graph can use the information of tensor program level . Last R.call_packed("custom_inplace_update", gv0) Allow the calculation diagram to be directly connected with TVM FFI Function to interact with .
The integration of multi-layer abstractions brings more design considerations , But it also brings many advantages after resolving the estrangement . for instance , Suppose we now have a fast operator optimization idea , In the traditional compilation process, a common practice is to introduce new rewriting rules at each level . But under the unified abstraction , We can directly introduce a pass Rewrite a local operator into call_packed Call a handwritten external operator . After confirming that the performance is reasonable, consider how to turn a local rewriting into a more automatic scheme . Similarly, we can integrate manual operators and automatic code generation schemes more organically . Last , Because for TensorIR The call of the program is directly represented in the program , We can put it directly to TensorIR The rewriting optimization becomes the call rewriting of the corresponding calculation graph , The information of tensor level transformation can be fed back to graph level transformation .
Last , It is worth noting that the example is about dynamic shape Our support uses symbolic shape The plan . In the example (n, m) and (n * m) Medium n and m Throughout the whole process . So that we can use this more information to do more dynamic related optimization when compiling optimization . And about symbolic Expression support is also perfectly compatible with TensorIR Hierarchical symbolic shape United . This feature also reflects the importance of collaborative abstract design .
Interact: Interactive open iteration
In addition to abstract Integration , A very important topic is how to make different people cooperate with each other .
Deep learning Compiler optimization is a field involving a wide variety of Engineers . Algorithm experts hope to develop new models and custom operators ,
machine learning System engineers need to develop system optimization , Hardware manufacturers need to iterate their own hardware . Each type of person will change to different levels of abstraction .
Deep learning The prosperity of the field is largely due to
Deep learning The open ecology of the framework . Anyone can use python Write our models and integrate them with the modules written by them . The traditional compiler field is often presented in a more closed way . In a multi-stage lowering Set up the framework under the framework of , Then provide a command line interface to the user .
Although a beautiful vision is that we can do well in every level of abstraction, it is best to , Then put all the parts of the rocket together . But such a wish is often unrealistic , The actual situation is that each layer still has its own problems . How to integrate and complement each other through system engineering , And it can make different people work together quickly, and iterating out new solutions is the problem we need to consider .
And this time allows different people to collaborate , Interactive development and iteration is a topic that we need to consider first . On this level , We follow the following principles : python-first, adopt TVMscript And direct multilingual integration architecture, so that everyone can pass python API Mutual cooperation . Open development can integrate and communicate no matter what level of abstraction it is . Collaborative iteration , So that we can work together to achieve better results . for instance , One
machine learning Experts can use computational expressions to write a custom operator , But this user-defined operator can be optimized by automatic conversion rules written by system experts , The automatic conversion rule itself uses the tensor instruction characteristics provided by the hardware manufacturer . Of course, when each link of the whole link can be linked together quickly , We can quickly iterate out the desired solution according to the requirements .
Automate: Automatic optimization and integration
Automatic optimization is always
Deep learning What's in your blood , Also realize Unity A key link in .
Many traditional automatic optimization schemes are exclusive , That is, only the whole optimization scheme adopts the corresponding model （ As some polyhedral model） Before we can solve , Once you try to introduce domain knowledge , Or the desired optimization has jumped out of the original category , Automation is hard to take advantage of . We need to change this view , Focus on how to effectively integrate domain expert knowledge in the process of automatic optimization . So that we can really integrate the power between automation and experts . The community is based on TensorIR Of MetaSchedule It is an important step in this direction .
Overall ecological integration
In the previous sections, we introduced TVM Unity Three key technical points of . Of course unity The goal of the design itself is not to solve all problems . We know very well that only when this circle and the whole
machine learning Only when we integrate with the hardware ecosystem can we maximize efficiency . Abstract integration makes it easier for us to provide more integration methods . We can integrate TVM Integrate into the existing framework , You can also integrate common hardware back ends into TVM In itself . Unified abstraction enables such integration to occur at all levels in a relatively unified way .
TVM FFI And other flexible interfaces also allow us to do flexible docking with the whole ecosystem . And integrate different backend ecosystems more effectively through automation .
Summary and future outlook
This article summarizes our understanding of
Deep learning Thoughts and future prospects in the field of compilation in the past two years . The promotion of the new generation architecture has always been the theme of our core concern , The features mentioned here have also been refactored or are in progress .TVM FFI It has matured over the past year ,TensorIR Itself has just been merged into the trunk , Follow up metaschedule It will also enter the trunk .Relax It also carries out open development in the community .TVM Unity As the core theme of the community, it will also be the focus of the next year , And some elements in the development process can already provide many benefits . As the community moves closer to this technology , The advantages of integration will become more and more obvious .
In the past year, I often saw something about
Deep learning Discussion of compilation infrastructure . In fact, from the core point of view , The infrastructure itself is certainly an important part （scala To spark equally ）, But no matter MLIR, TVM Or other compilation frameworks, and the infrastructure itself is maturing in mutual learning . The real bottleneck lies in abstract design and further real collaborative Integration . Of course, the choice of three key technical points does affect our thinking about infrastructure . such as TVM FFI As an important infrastructure, it supports interaction and communication python-first Characteristics of .
Regardless of infrastructure , The real problem we face is how to solve these key barriers ,multi-stage lowering Because of its own defects, the existing scheme must make a breakthrough in order to evolve into a new generation
Deep learning Compiling system . This paper discusses the direction of this evolution and the specific technical route . I hope it can inspire the whole field .
On the way forward in the new generation of Architecture , We are no longer working alone . Many key component designs are completed by various students together . We also welcome more students to join the community , Together to promote the realization of a new generation
Deep learning In the joint construction of compilation system .
Link to the original text ：https://zhuanlan.zhihu.com/p/446935289?utm_source=wechat_session&utm_medium=social&s_r=0
author[Heart of machine],Please bring the original link to reprint, thank you.