current position:Home>"We're not really a search engine"

"We're not really a search engine"

2022-05-08 02:23:43Summer Arbor

“ We really don't do search engines ”

Tiger sniffing : stay Peak Labs Before the speech of Founder Ji Yichao , Tiger sniffing F&M The Innovation Festival announced 2019 Annual creativity list , As a tiger sniff F&M The Innovation Festival is a project “ Has a long history ” The traditional link of , Tiger sniff has been committed to , By constantly digging 、 Selection 、 Report on what makes our lives different , Products that refresh our senses 、 Content 、 figure , To this moment , Innovators at the node of this era 、 Pay tribute to the spirit of innovation .

Let's take a look at what products 、 works 、 The character gets tiger smell 2019 Brain hole of the year award . then , Let's get back to the point , Listen to what's on the creativity list Peak Labs Founder Mr. Ji Yichao , In this distinctive “ Knowledge extraction engine ” Brain hole product innovation 、 What are the experiences of landing 、 Industry experience .

Tiger sniffing 2019 Brain hole of the year award “ Annual creativity list ”

2019 Creative software of the year : Hongmeng OS and Magi browser

Creative digital products of the year :vivo NEX35G edition

Annual creativity secondary products :《 The devil child of Nezha came into the world 》

Creative variety of the year :《 Central Radio and television station 2019 Compere competition 》

Annual documentary on creativity :《 Two hundred years of surgery 》

Annual creativity show : Of the Ullens Center for contemporary art 《 Picasso : The birth of a genius 》

Creative architecture and space of the year :MAD Siheyuan kindergarten of the architectural firm

Annual creativity and cross-border innovation : Lei Jia Yin PK Handmade Geng , Baowo auto factory live selling cars ,

Brain hole of the year :B Station Digital up Lord , Hello teacher, my name is he classmate

The following is a Peak Labs The speech of Founder Ji Yichao ( With deletion ):

Hello everyone ! Today, I am honored to win the tiger sniff brain hole Award , although , Or write us a search engine .

ad locum , Let me tell you again ,Magi It's not really a search engine . We have also received a lot of attention recently , Actually , This is also a quite unexpected thing , We are a team that takes knowledge engineering very seriously .

You may be interested in domestic search engines , There are always some small complaints , Or high expectations . therefore , Pushed us to the front desk . Of course , I think this is actually a good thing . then , I hope to take this opportunity to share with you what we are doing , Or say , We are interested in the present AI( Artificial intelligence ) Some views on Development .

The topic of my speech today is related to large-scale knowledge engineering , It's called 《 Human shared brain ,AI Behind the scenes AI》, Speak louder , Let's start with small things .

Why the development of natural language processing is not ideal

Let's take a look first , Now? AI Several fields that have developed well , voice 、 Computer vision (CV)、 robot , What do these areas have in common , Why should we explore what they have in common ?

Actually , Maybe a lot of people , Including myself , Will feel why natural language processing (NLP) The development of , It seems that they are always inferior to them , It's a fact . The domestic NLP Companies are living a hard life, ha ha 、 Half dead , Including us ( own ). then , We just want to see what these areas have in common .

This is not me ( A person ) Point of view , Many people think that voice 、 Images 、 robot , To some extent, it is something of perception . What is the degree of perception ? There are many things at the perceptual level, and animals may also have .

Of course , These areas are not shallow , But rather , The reason they are so good , Because they may have some better 、 Interpret naturally . such as , Voice we can use “ Band ” Describe it clearly , The image has pixels , The robot has some signal control .

however , Natural language understanding is an industry , Or natural language processing , In fact, it started earlier than these fields . however , It seems that until the outbreak of deep learning , I didn't see it ( natural language processing ) There is a particularly good development .

Including myself , There are many people in the industry 、 scholars , It is generally believed that the field of natural language processing is actually a little behind these fields , By falling behind, I mean “ to ground ”, There are also some cutting-edge technologies . Maybe now the stage of development will come 2015 The state of about years .

Some people say AI Four little dragons , Actually , That means CV Four Little Dragons , Including soup 、 Yitu or something ( Enterprises ). This reminds us of , Why does our field seem a little miserable ? Because our practitioners in this field are stupid 、 Are you lazy ? Definitely not , Because it won't have such a big impact .

Let's think about this problem carefully .

First of all , What's the matter with language , This is a book written by a cognitive computational scientist at the University of Edinburgh . Mentioned inside , Humans have many close relatives , monkey 、 Orangutans and other animals . They are actually like humans , All mammals , But why only human beings have mastered the complex form of language , This is a very strange thing .

Why should I take animals as an example , Animals can also see things , Can also send signals . for example , Killer whales can emit 60 A variety of different signals to communicate with partners , Like the hot skinned shrimp a while ago , Its eyes have 12 A different kind of perceptron . There are only three kinds of human beings , And the world in their eyes is colorful , It's something we can't even imagine .

Monkeys can also grind branches by themselves “ build ” Weapons to hit people , Than the one in our family who only “ chew ” The slipper sweeping robot is much smarter . Why language these things , None of our close relatives have taken care of it ? Scientists began to think about evolution , I want to explore how language came into being .

The last paragraph of the book says , Only humans begin to live in a large enough community , then , Gradually form a form of communication based on ostensive and inferential .

This is actually a very unnatural thing . such as , I used to be a monkey , I also have a monkey next door , One day I saw a bear , I pointed to the bear and said “ Whoosh, whoosh, whoosh ”. in the course of time , Our group began to call bears “ Whoosh, whoosh, whoosh ”, These are very unsuitable for computers to deal with .

Because language or natural language is one after another “ practice ”, Cover to another “ practice ” On . Keep rolling around , After thousands of years , It forms the language we often speak now .

I can give you a concrete example , This is what I heard others say a while ago , I think it's very good . We often say a word , No one can beat Chinese table tennis . We all know , Chinese table tennis is very good , That is to say , No one can beat us .

If we change the three words of table tennis , Said that no one can beat Chinese football , Everyone understood at once , This is a mockery of the national football team . But if you're a computer , You don't have a common sense , Or in the context of no so-called world view , When you see these two sentences again , Actually , Only the difference between football and table tennis .

This is because , When you understand every paragraph , It is no longer simply perceived from an input , We have brains , We have our own lives , So we can understand that these contents refer to . More broadly , such as , Some are more complicated 、 Long 、 Context understanding , Pronouns refer to .

Just now, , These things we are talking about are collectively referred to as common sense , Now? , We might think our voice assistant , It can often help us complete transactional operations . for example , Tmall elf said turn on the light 、 Turn off the lights 、 Do something . But often we ask a little more than we expected , The voice assistant may be stunned .

Let's go back to 20 The search engine years ago , Why is that ? Many times computers lack common sense , Common sense is a very difficult thing to do . because , We often think that the reason why people can communicate effectively , Because we have a consensus , And where did the consensus of robots come from ? Be sure to come from the data , So , Data for NLP It's very important , However, this is another very tragic thing .

Just said , Now computer vision is developing very well , We think we should give credit to Dr. Li Feifei Image Net career . Of course , There are many other large-scale annotation sets , They have given the development of computer vision an unprecedented opportunity . because , They have a lot of labeled data .

Any downstream task , Can be pre trained upstream , So as to get a good promotion , This is a typical crowdsourcing marking process . such as , We now have an image of a cat , then , We asked three children , Even if he may not have gone to school , Everyone knows it's a cat .

For voice 、 The same word , We let different people practice , To collect these data , Also no problem , Even if you have an accent, it's not a problem . Of course , If your accent is at the dialect level , That's another thing , That's why people use different models to train Mandarin and Cantonese .

To the field of knowledge , The problem will become more uncomfortable . The same sentence , Is there a university in Beijing , There is a professor named Ji hang ; Still say , There is a professor at Peking University called Ji hang ; still , There is a university professor named Ji hang in Beijing .

This is not a molecular problem , Molecules are actually shallow natural language processing , This is the problem of understanding the ambiguity of knowledge . I know what this sentence means , Because Ji Hang is my father , But if these robots don't have common sense , Will not be able to explain this problem .

For crowdsourcers , This is also a very big problem , We often say artificial intelligence , How many people , How much intelligence there will be . The field of natural language understanding , Large scale missing sample annotation , For the most part , It is because it is difficult for us to rely on many people to complete a very standardized annotation .

Like this question just now , If you mark this data , There will be a large number of internal enterprises , In the end, the model either does not converge , Or just miss it , Nothing can be done .

Combined with the two points just mentioned :

  • Today's AI lacks common sense

  • There is a lack of large-scale data in the field of natural language processing

that , Many people believe that , The knowledge map should be one of the saviors to solve this problem . Actually , The knowledge map has existed for a long time , But it has always had a big problem , Is where the knowledge map comes from ?

Like Google Google, Or other companies , Including domestic 、 Foreign cities will have the application of knowledge map . If , Let's think about how the knowledge map comes from , This question is very interesting . To put it mildly , Now most of the so-called knowledge maps are actually interactive encyclopedias 、 Baidu Encyclopedia 、 Wikipedia takes a union , Combined with the existing structured data .

such as , World Bank, etc , They are equivalent to putting the data that human beings have sorted out , Constructed to better connect . Of course , On this basis , You can use it. AI Make deeper connections to fill in some missing , Or put more diversity 、 Multimode data combined . But this has never solved a very practical problem , That is, most of the information in the world , In fact, they are all in their own text .

What is self owned text ? such as , The body of a web page , Another example , There are many reports in our enterprise , There are so many Excel There is a form of , In fact, it is already a structured and semi-structured form , It can be easily understood by the program .

Imagine , You have an article Word article , This Word The article may be a resume , It could also be an internal report . They actually exist in the form of text , There is no solution , How to extract knowledge from its own text and structure it .

Imagine , If we can do such a thing , In fact, we can significantly improve the utilization and scale of information to a higher level .Magi This product is not a search engine , What it wants to solve is the problem of where the knowledge map comes from , Or large-scale knowledge engineering , And from very unreliable text , The technology of building a knowledge map as credible as possible .

Research Institute Gartner This technical hype curve shows , Experts believe that the knowledge map may be 5-10 Years to really mature , Towards the production environment .

I believe their investigation should be in Magi Written before exposure , The main reason they said is not wrong , In fact, our company started from 2015 I've been doing this since , We ran away 5 year , So it's normal .

A while ago , You can see from the media that there is one called Magi Our search engine suddenly became popular for no reason , This is really unexpected , We are too lazy to explain .

I haven't seen it yet Magi My friend, take a look , This is our public version Search results for . You search for some entities , Or ask natural language questions directly , We can be based on the data of the whole network , Automatically organize into an incomparably large knowledge map , It is not based on any existing knowledge base , This is essentially different from the application of knowledge map just now .

such as , This interface , What we are looking for is actually knowledge in the medical field , Not from an authoritative database , But learn from the self owned text of the whole network that does not involve the white list , This interface looks very fancy , There is red 、 It's green 、 Yellow .

Actually , Each color represents the credibility learned by the model , This credibility does not mean the probability of this fact , It's the probability that he judges whether he has learned correctly . therefore , We must not put AI What comes out of the black box , Directly as a correct knowledge , Tell ordinary users . That's why , We ourselves attach great importance to traceability .

AI Every piece of information given , Reference materials must be given . Not like a question and answer robot , You ask a question , I believe it when it says it , This is not allowed .

How do you calculate these colors ? I have answered the technical details in detail before , I can give you a simple and easy to understand example , Suppose we have a paper now , What do you say about this paper? Is it good or not ? Great influence , One of the simplest indicators is to see how many people cite it .

Google was very talented to launch... Based on this assumption “PageRank”,PageRank It's actually a web page , If more and more people reverse the chain to me , Then my popularity and importance should be higher .

All these years have passed , We can make such assumptions about knowledge , Suppose there is a knowledge point , It is mentioned in different forms in more different high-quality data sources , It should have higher credibility .

There are two points mentioned here : One is , More different high-quality sources ; The other is , The expression of each source should be different , It has different contexts . Because the internet corpus is actually very dirty , Many authors' articles will be washed by others 、 Plagiarism 、 Splicing .

If two articles , We learned the same knowledge , If the expression of these two articles is very different , It shows that this article has at least been re created , Tested by people . amount to , We constructed a semi-automatic mechanism , To mine all the knowledge from plain text , This is the public version Something to show you , It's not really a search engine .

How to experience such a good thing ? Go straight up, As long as you add this to the home screen , It can become a APP, It won't have a backstage , No push , Very clean , Interested ( people ) You can use it . Of course , Our small company servers are not very awesome , Often over time .

This public version addresses , One side , Let many enterprises know NLP to ground , We can give them a very similar presentation , Why do you say that ? because Facing the world's most “ dirty ” Internet , The data in the enterprise is relatively clean .

We gave a very crude example , stay “ a septic tank ” There's a pattern of butterfly inside , It's absolutely easy in everyone's own swimming pool .

On the other hand , To show you , The product is actually a self supervised cycle process . What is the purpose , In order to construct a very large scale 、 Annotated dataset . meanwhile , Can try to solve the present AI Lack of trial and real-time knowledge , These two points are very deadly , and , Practicality is also very important .

In Xiaomi's previous release MIX Alpha Mobile phones as an example , Because Xiaomi MIX Alpha Mobile phone confidentiality is done very well . No one disclosed the price before the press conference , At the end of the press conference, you can see , Have learned Xiaomi Alpha The price of ,19999 element , But at this time, its confidence is very low , Only 11.

Then after a while , It's probably over 10 minute , We've been in more text 、 More different reports have learned 19999 The price of yuan , Confidence began to increase . And you can see Baidu Encyclopedia in the lower right corner , This kind of thing depends on the user's UGC The border , And the speed of users can never keep up AI The speed at which information is generated .

therefore , You can see in the 16 spot 56 Minute time , Baidu Encyclopedia has no one to edit this price . however , Has been to Xiaomi MIX Alpha The price of has a more credible result .

Then a few hours later , About millet MIX Alpha This entity has more information , Will gradually converge from different places on the Internet , Its credibility will gradually improve .

What are we doing ?

Let's take another inventory Magi What is this project doing :

First of all , What we need to solve is from plain text , Automatic construction of credible knowledge map technology . It may be a little convoluted , A while ago, it did get some affirmation from the academic circles , Because this is a very radical attempt .

second , Just mentioned , Build large-scale annotated data sets , We really want to benchmark in the field of knowledge .

Third , Lifelong learning and continuous optimization through the Internet , We just mentioned and Magi, It is a form of web page that we often see , and Magi That's the technology we really learn .

Through the Magi This technology is deployed in above , Keep learning the text of the Internet , More and more external statistics are introduced , As just mentioned, cross validation also has factual contradictions , They automatically optimize the good results and the wrong results .

We can see , Now? A bad result was found on it , A refresh tomorrow may have been AI Automatic judgment filters out . Through this continuous optimization , Can get a better basic model , Be able to complete a second item to build annotated data sets , This is a self-monitoring process .

Fourth , Solve the problem of acquiring common sense knowledge just now , And structured , We finally connected this model to the world , In this way, you can constantly update yourself , Master the latest knowledge .

The fifth , For multi task migration and cross domain learning , I'll talk about... Later .

The sixth , It's a very ambitious vision . We often say that it can be explained AI, We are now ready for the basics , At least not , Our knowledge can't even trace the source . therefore , Every result given , You can all see which paragraph you learned from , It's not from nowhere .

After all , We are a commercial company , Still have to live . therefore , Let's talk about it , Why do we say we're not a search engine . because No commercialisation , There are no advertisements , And especially burn graphics cards , Every time we use it, we lose money .

that , How do we survive ? Actually , We provide Magi Such automatic learning technology . such as , A headhunting company found us , We now have a lot of resumes , We want to build a structured talent pool , Can you use Magi This technology helps us automatically read these resumes , To extract human knowledge and organize it .

In this case , You can even end up searching , Directly speaking, there are... In Beijing 3 In the above 、 Language development experience 、 Software engineer with graduate degree , And in ascending order of age , This can be done .

Before , Similar functions can also be achieved through some company technologies deep into the vertical field , But there is a very big problem , If you want a vertical field to achieve this effect , You need these customers to provide their unique data , To be exact, structured data .

And this process is not smooth , such as , As we all know, doctors are well paid , And they have their own mission — Heal the wounded and rescue the dying . I can't say you doctors , Don't see a doctor these two months , Come and give us the data , Prepare 10000 pieces of data to ensure that they can come out , This is unlikely .

And we just said In the process of continuous self-learning , Have accumulated more and more common sense , It can be said that I have learned how to read . But what to read out , Now our customers don't have to give us more than 10000 , Give us hundreds of 、 Thousands of them can be achieved by making a guide , The whole process can be completed by providing a very standard graphical interface .

in addition , Because we 2015 Year began , Very early . After Google's model came out last year , You can pay more attention to language transfer 、 Zero sample cross language transfer . In fact, we started very early , And we support 170 Languages , This is a training based on Chinese corpus , therefore , The effect is not necessarily very good .

Two examples , One is Japanese , Japanese is very close to us , But the subject predicate object logic is different from Chinese , Just shared a lot of words ; The other is like Thai , In fact, it includes Arabic, a language written from right to left , It's not a big problem .

Time is limited , Don't say too much about technology . We want to say that it's really very difficult 、 It's very painful , We did 4 It took years to achieve today's level . We built a so-called Google, Because you see Although web search is provided , This web search is not our main , But it serves the whole model . As a support for input trust , We don't use Baidu 、Google These structures , We first built a search engine from scratch .

second , We prepared the snowball ourselves , Zero consumption version of data , This is unique , Not an open data set that academia can find . And now , In fact, we were exposed in advance , So we are not ready for a lot of things now , When you use our service, you may encounter all kinds of accidents , Please be tolerant .

source : Tiger sniffing net

Focus on flush finance WeChat official account (ths518), Get more financial information

copyright notice
author[Summer Arbor],Please bring the original link to reprint, thank you.

Random recommended