Myra's Technology
World-leading understanding of language driven by deep learning
Overview

Myra has developed a proprietary, continuously improving, concept-based search engine, which uses deep learning to understand the semantic content of queries and documents, and it learns from user behavior to continually improve over time.

Introduction

 Most search engines today, including popular consumer-facing and enterprise search engines such as Google and Elastic Search, are based on keywords - the search engine finds documents containing the keywords in a user's query, using simple statistics like TF-IDF to find documents that feature those keywords frequently or in close proximity to each other within the document. While keyword search can return helpful results, the process does not consider the context of the user's questions. Consequently, the optimal result may never get returned since the search engine is simply attempting to match keywords from the search query with search results.

Current academic research is focused on question/answer systems, where a user asks a question in natural language and the system returns the answer from a set of documents or content it has ingested. However, these Q/A models require supervised learning on structure data sets. In product, structure data necessary to properly train these system either doesn't exist or requires significant and expensive manual effort to obtain.

Myra's concept-based search engine uses deep learning to go beyond keyword-based search and enables search based on the semantic meanings of documents. Moreover, the engine continually improves over time as it learns from users' interactions with it. Our approach blends a natural language processing (NLP) component - which extracts the meanings of the user search queries - and an information retrieval (IR) component - which represents the documents in a searchable high-dimensional “semantic” space.

In common with the Q/A systems, our concept-based search engine makes use of state-of-the-art deep learning techniques, but it bypasses the expensive supervised learning model which requires such a large set of training data that its unrealistic for use in most production environments. Instead, we use the two other types of machine learning - unsupervised and reinforcement. First, we train a search engine by using unsupervised learning of the semantic content of a set of documents. Then, once the search engine is deployed, reinforcement learning is used to learn from the users' interactions with it in order to improve over time.

Theory

Model
Our semantic search engine model can be broken down into two main components: a natural language processing (NLP) component that reads the user queries to extract their semantic meaning, and an information retrieval (IR) component that embeds the documents in a way such that they can be looked up by meaning.

Our NLP component is a deep neural network. In the first layer, we use pretrained word vectors to encode the meanings of individual words. Then, we use convolutional layers to add contextual meanings to words by looking at the neighboring words. After several dense layers, we apply an attention layer to enable the model to pay attention to only the most important words in the query. This outputs a many-dimensional representation of the semantic meaning of the query.

In parallel, we also train the IR component, which is a trainable many-dimensional representation for each document in the corpus. For a given search query, the search results are the documents that are its nearest neighbors. In other words, the results are the documents whose semantic representations are closest to the semantic representation of the query.

Training Process
We train this model by minimizing a loss function that rewards returning the correct document and penalizes returning the incorrect document. After training the model with many examples of search queries and documents, it will learn to return the correct documents for a given query.

Supervised learning - learning from examples of search queries and corresponding 'relevant' documents - would be the most obvious approach to training these models. However, producing a large number of example queries, that cover all the documents in a potentially very large corpus of documents, would be intractable.

Instead, we train our model using an unsupervised learning approach. We randomly sample phrases from a document and treat those sampled phrases as search queries. This process of simulating search queries simulates data for all documents in the corpus without requiring any sample search queries by hand. In fact, all that we need in order to create a semantic search index for a new customer is to crawl all of their documents, and the model learns to search the documents automatically.


After the initial model is created, we can use reinforcement learning to learn from the users' interactions to improve the search engine's quality. Each time a user types in a query q and clicks on a resulting document d, we have a new data point that says that query q should return document d as its first result. Periodically, we collect all of the user data and use it to improve the model. We blend both the initial simulated data and the real user-generated data for this training stage.


In addition to finding relevant documents, our search engine also finds the most relevant snippet from each document. We do this by applying our trained neural network to all sentences in the document, in order to find the one with the greatest semantic similarity to the search query.

Implementation

We have implemented this concept-based search engine in TensorFlow, using a custom “estimator” in the Estimator framework. We do this for easy model training and serving on Google Cloud.

When we set up a new customer, we crawl their webpages and train a new search engine model using unsupervised learning, as described above. We then deploy the trained search engine to Google Cloud for serving.

Once a semantic search engine has been trained and deployed for a given customer, it is ready to be queried. When a user submits a query to a server, the text is tokenized at that server locally, then sent to our search engine on Google Cloud. It returns a list of the ids of the documents in the search results, as well as the ids of the relevant snippets of text from each search result. The ids are then used to look up the corresponding documents and snippet texts that are stored on DynamoDB, then these search results are returned to the user.

Finally, when the user clicks on one or more of the search results, that data is stored. Periodically, batches of data points from user interactions with the search engine are used to improve the search engine, so that the relevant documents come up higher in the list of search results in the future. This reinforcement learning procedure allows our search engine to improve over time.

Results

The semantic search model for information retrieval has proven to outperform all other traditional methods on all benchmarks. Based on numerous live experiments evaluating our model across hundreds of thousands of documents, we've seen that the method tied or outperformed the traditional keyword method 98% of the time and was over 5 times more likely to return a result that matched what they user was looking to retrieve.

Subscribe for the latest!

Don't miss news on conversational AI, automation, and our latest announcements