Patent Depth

Natural language processing (NLP) refers to the processing of information such as shape, sound and meaning of natural language by computer, that is, the input, output, recognition, analysis, understanding and generation of words, words, sentences and chapters. Realizing information exchange between man and machine is an important issue of common concern in artificial intelligence, computer science and linguistics.

In recent years, technology giants and startups have successively invested resources and costs in commercialization exploration. But natural language processing has not made great progress in many aspects except speech and machine translation. For example, sentence recognition mainly includes identifying verbs, nouns and adjectives in a sentence, which are very simple and basic tasks. However, from 2009 to 20 17, its accuracy rate increased less than 1%, and the current accuracy rate is only 57%. Although natural language processing has become a hot sub-industry of artificial intelligence, the technology itself still has enough room for growth and is still in its early stage.

Based on this, the Research Group of Key Technologies of Artificial Intelligence, a patent analysis popularization project in China National Intellectual Property Administration, made an in-depth analysis of the natural language processing industry around the development route of patent technology and important applicants, starting from special technologies and general technologies, for the reference of the industry.

Deep learning promotes the rapid growth of patent applications for natural language processing.

Trend of patent application for natural language processing and its technological evolution path

Natural language processing technology starts from around 1970 and ends at 1985, with no more than 30 applications per year.

After 1985, with the development of network technology and computer technology, a rich corpus has become a reality, the hardware has been constantly updated and improved, the trend of natural language processing has shifted from rationalism to empiricism, and the method based on statistics has gradually replaced the method based on rules. The number of applications began to increase rapidly, and by 2000, the number of applications reached 780 every year. Jarinik and his IBM Watson lab are the key to promoting this change. They used statistics-based methods to improve the speech recognition rate from 70% to 90%. At this stage, the natural language processing method based on mathematical model and statistics has made a substantial breakthrough, moving from laboratory to practical application.

From 2008 to now, inspired by the achievements in the fields of image recognition and speech recognition, people gradually began to introduce deep learning to do natural language processing research. From the initial word vector to 20 13 word2vec, the combination of deep learning and natural language processing reached a climax, and achieved certain success in machine translation, question answering system, reading comprehension and other fields. The annual application volume is from 1258 in 2008. Deep learning is a multi-layer neural network, which starts from the input layer, gets the output through layer-by-layer nonlinear changes, and does end-to-end training from input to output. Prepare the data input to the output pair, design and train a neural network, and then perform the expected task. RNN has always been one of the most commonly used methods of natural language nursing, and GRU, LSTM and other modes have triggered round after round of upsurge. For this reason, since 2009, natural language processing related patent applications have ushered in a new round of growth.

China and the United States are the most competitive countries in this field

From the perspective of source countries, China and the United States have the largest number of patents in this field, and are the main technology reserve countries and source countries.

Trends of patent applications in China and the United States

In this field, judging from the application trend of China and the United States, the number of patent applications in both countries has shown a steady growth trend, indicating that the two countries attach relative importance to the research and development of natural language processing technology and patent reserves. Generally speaking, although there was a certain gap between China and the United States in the early days, after long-term accumulation, China surpassed the United States to become the country with the highest number of patent applications in the world in 20 12, reaching 526/year; Then the gap widened further. 20 1668, China reached 1668, which is twice as high as 856 in the United States. China has surpassed the United States to become the country with the highest annual application volume in the world, and it is likely to become the country with the largest patent reserves in the world in the next few years.

Baidu entered the top ten in the world.

China's patent reserve of innovation subject needs to be strengthened.

Global ranking of major applicants' patent applications

In the ranking of major applicants in the world, IBM has a great advantage in the number of applications and belongs to the first camp; Compared with IBM, Microsoft's application volume is less than 400, which is 4/5 of IBM's, belonging to the second camp; The third NTT communication is less than 300 pieces away from the tenth Foxconn, belonging to the third camp. In China, Baidu ranks eighth, with 457 applications; Foxconn in Taiwan Province Province, China has also entered the top ten in the world. China's patent reserve of innovation subject needs to be strengthened.

The acceleration of technical iteration has promoted the rapid development of natural language processing technology.

(1) Deep learning promotes the rapid development of part-of-speech tagging technology.

Part-of-speech tagging is to add a part-of-speech tag to every word in natural language. Correct part-of-speech tagging is a basic step in natural language processing, and wrong part-of-speech judgment may lead to a wrong understanding of the whole sentence.

The development route of part-of-speech tagging technology

From the technical development route, there are few patent applications for part-of-speech tagging before 1980. During the period from 1980 to 1990, a rule-based part-of-speech tagging method appeared, which was put forward earlier. The basic idea based on rules is to establish a tagging rule set, and make the tagging rule set as accurate as possible, and then use the tagging rule set to tag the corpus to be tagged, so as to get the correct tagging result. The disadvantage of rule-based part-of-speech tagging is that it is too targeted and difficult to further upgrade, and it is also difficult to adjust according to actual data, so it is not good enough in practical use.

After 1990, the part-of-speech tagging technology based on statistics was developed, and hidden Markov model and conditional random field model were applied to part-of-speech tagging. All knowledge is automatically acquired through the parameter training of corpus, which can obtain good consistency and high coverage. Therefore, the part-of-speech tagging method based on statistics is widely used. However, statistics-based methods also have shortcomings and limitations. For example, when establishing model parameters, a large number of training corpus is needed, and the selection of training corpus will affect the accuracy.

Because both the rule-based method and the statistical method can't handle some problems satisfactorily, some people put forward a part-of-speech tagging method based on the combination of rules and statistics, which mainly combines dictionaries and statistical models. This combined part-of-speech tagging method makes up for the influence of a single method on the tagging results to a great extent, and gives full play to the advantages of the rule-based method and the statistics-based method. In fact, the combination of the two methods is the combination of rationalism and empiricism.

In recent years, methods based on artificial intelligence have also been applied to part-of-speech tagging. Compared with the first three methods, this method has the advantages of strong adaptability and high accuracy. Applicants from China have done a lot of research in this field. Their technology is explosive and they have achieved a series of research results.

(2) Unsupervised learning is the main development direction of word-level semantics, and innovative subjects accelerate their entry and have different layouts.

The goal of semantic analysis is to realize automatic semantic analysis of various language units (including words, sentences and chapters) by establishing effective models and systems, so as to understand the true semantics of the whole text. The focus of lexical semantic analysis is how to obtain or distinguish the semantics of words.

The development path of patent technology oriented to word-level semantic analysis

There are many methods of word-level semantic analysis. From the perspective of development, dictionary semantics, grammatical structure, bilingual dictionary and Yarowsky algorithm no longer produce new important related patent applications in dictionary-based semantic analysis. There are few important patent applications based on examples and statistical models; Due to the development of keyword extraction technology, related technologies based on semantic dictionaries still produced related key patents in 20 17, which will be one of the development focuses in the future. At the same time, based on unsupervised learning, driven by big data, algorithms and chip technology, it will become the main development direction in the future because it does not need a special corpus and has strong expansibility.

Analysis of Important Semantic Application Words in China

As of August, 2065438+2008, among China applicants, there are 6 applicants with more than 3 patent applications, among which Qilu University of Technology ranks first, followed by Kunming University of Science and Technology, Baidu, Tencent, Fujitsu and IBM. As for foreign applicants in China, IBM began to file patent applications for disambiguation based on dual dictionaries on 1999, and then filed patent applications based on context acronyms and word packs on 2014 respectively. Fujitsu filed its first patent application based on bilingual disambiguation technology in 20 12, followed by patent applications based on combination probability and for reduced words in 20 12 and 20 16 respectively. Kunming University of Science and Technology submitted a patent application for disambiguation technology based on improved information Bayesian method in 2008. Tencent's related patent application focuses on the use of word popularity, text-based content, basic word dictionary and phrase dictionary, and submits a patent application related to dictionary construction; Baidu filed its first patent application on 20 12, and its research interests include the construction of multi-granularity dictionaries, the use of user selection, and the search based on ambiguous word resolution. On 20 18, Baidu filed a patent application for word-level semantic analysis based on unsupervised neural network.

In the early days, Tsinghua University, Peking University, Institute of Acoustics of Chinese Academy of Sciences, Harbin Institute of Technology, Nippon Electric (China), Google and other research institutes and enterprises all applied for related patents in China. With the development of technology and the emphasis on innovative disciplines, Nanjing University of Posts and Telecommunications, East China Normal University, Foxconn, Shanghai Jiaotong University, etc. have also conducted research in related fields. After 20 14, Suzhou University, Nanjing University, Sun Yat-sen University and other universities also joined the word-level disambiguation research and development.

It is worth noting that, although China applicants participated in word-level disambiguation research in various periods, except Kunming University of Science and Technology, most of China applicants with strong early strength did not continuously submit relevant patent applications. In the unsupervised disambiguation that leads the development of word-level disambiguation technology, only Baidu has submitted relevant patent applications.

(3) Neural network is the focus of the development of machine translation. IBM has accumulated a lot, and Baidu has accelerated to catch up.

In the 1940s and 1950s, the related technologies of machine translation were in the stage of theoretical research, and the invention of computers and the research of information theory laid a theoretical foundation for machine translation. During this period, no relevant patent application was filed.

Development of machine translation system industry and technology

Since 1960s, it has entered the era of rule-based machine translation system. Related patents began to appear sporadically, among which IBM, as a pioneer in the computer field, played a very important role in this period and accumulated a large number of basic patents on formal machine translation systems. In addition, universities and government research institutions are an important part of this period. Machine translation products similar to Systran system were born in university laboratories and survived and developed through government project cooperation.

From 1980 to 1990, the machine translation system gradually matured and went to the market. During this period, the number of patent applications began to explode, mainly from enterprises. But since 2 1 century, the advantages of internet companies in this field have emerged. With the huge accumulation of Internet corpora and algorithms, Internet companies such as Google, Microsoft and Baidu have surpassed established companies such as IBM and Toshiba. Especially with the technological revolution brought by deep learning in recent years, the importance of data resources has been greatly reduced. In recent years, revolutionary technologies all come from the innovation of system algorithm framework.

look into the future

Although the United States and Japan have accumulated a lot in the field of natural language processing in the early days, China has accelerated its catch-up in recent years. China has become the country with the largest number of patent applications and the second largest patent reserve in the world. The future competition will be mainly in China and the United States. At the same time, the combination of artificial neural network and natural language processing has promoted the rapid development of general technologies such as lexical analysis, syntactic analysis, semantic analysis, language model and knowledge map technology, and accelerated the landing of special technologies such as machine translation, automatic summarization, automatic question answering and emotion analysis. Increasing the research and development of natural language processing technology based on neural network will help China and domestic innovators overtake in corners and seize the artificial intelligence highland.

Yin Qiliang, Luo Qiang, Ye Sheng | China National Intellectual Property Administration Patent Analysis Popularization Engineering Artificial Intelligence Key Technology Research Group