Bill Roth, Ulitzer Editor-at-Large

Bill Roth

Subscribe to Bill Roth: eMailAlertsEmail Alerts
Get Bill Roth via: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn

Related Topics: SEO Journal

Blog Feed Post

Sophia's Contextual Discovery Platform: Indexing Technology

(ed. note: this is the first in a series of posts describing our technology.)

Text Analytics

Textual analytics is the process of software determining the content of a document so that human “consumers” can more readily find and/or work with that content. Textual analytics software takes many forms. Most people are familiar with the generic phrase search engine, for example, which is a very specific type of analytics software designed to allow users to find information within a larger document set.

Regardless of its form or function, any analytics software must ingest the textual portion of the content within a collection and understand it. Otherwise it provides no useful information to the user. This ingestion process is called indexing, and it is similar to the activity a human would have to go through in order to understand textual content. It would just take humans a much longer time to accomplish, with far more inconsistencies and errors in reading and interpreting it.

Many different indexing approaches exist. Some of the first analytics tools used a more simplistic form of indexing—basically, determining all the words used in all the texts in the set—but these tools only allowed for simple keyword (Boolean) searching. A keyword search allows a user to specify a word or group of words (known as the query) for the software to compare against all the content in the dataset. If a text contains the single- or multiple-word query, then it is a “hit” and is returned in a flat list of all texts matching the query. Ranking in that flat result list is usually based on the number of times a query appears in a text—the higher the number, the greater the ranking or implied relevance. This is largely what Google does.

Keyword Indexing

Keyword indexing and querying served (and still serves) a useful purpose; however, as users’ needs became more sophisticated, analytics technologies had to evolve past keyword-based indexing in order to keep pace. The problem is that keyword-based analytics does not necessarily guarantee the true relevancy of the results. It only guarantees a match between the query and any resultant text. Just because a text contains the term “bank” does not mean that it is relevant to a user who is looking for information about financial institutions. The term “bank” has many meanings besides referring to a place where a person can cash a check. Therefore, more intelligent indexing software is required in order to provide more relevant information to the user.

Next-generation Analytics Tools

The trade-off is that these next-generation analytics tools—which provide a more sophisticated analysis of documents—require more time and hardware resources to perform the indexing process. Just how much more time or hardware depends on the particular proprietary indexing technology in question. These proprietary technologies can be based in linguistics, statistics, mathematics, or some combination of the three in order to determine the meaning within content. Regardless of approach, advanced analytics software is only useful if:

  1. It can accurately understand textual content, and ultimately meaning and relevance with little or no human intervention or upkeep.
  2. It can index content quickly enough to keep up with the explosive growth of data populations.
  3. It can be used to solve real data-centric problems.
Sophia Logo

Sophia Contextual Discovery Platform

Sophia’s research and development has resulted in the Sophia Contextual Discovery Platform (CDP), which is an advanced analytics technology designed to meet the above conditions using a novel approach to indexing text-based content. It leverages the science of semiotics in order to determine meaning within texts.

Semiotics is a model of linguistics that explains how humans understand meaning when communicating. It focuses on words in order to determine how meaning is constructed and understood within a text. At the core of semiotics is the premise that a word must be analyzed in light of its context. Meaning is conveyed by words as they relate to other words within a localized context.

Furthermore, our model stresses not only the importance of intra-textual context (the words within a single text that combine to create meaning within that localized context) but also the effect of inter-textual context (that a text’s meaning is affected by other texts within the same dataset). This is called intertextuality and means that Sophia’s CDP can understand meaning and context at both a local (individual text) and global (dataset) level.

Inter-textual Context

Take, for example, the term window. What does it mean? Alone, it has the potential to mean quite a few things. By seeing it within the larger context of other terms around it, its localized meaning becomes apparent:

“Close the popup window before continuing to format the document.”

Clearly, the term window in the above sentence means a framed box generated within a graphical user interface on a computer monitor. This meaning is fully conveyed through the other meaningful context words around it such as popup, format, and document. We know it does not refer to an opening in a building structure, as in the sentence below:

“Ensure that the ventilation slot is open on the window to allow fresh air into the room.”

We are able to distinguish this meaning of window due to the significant context words around it such as ventilation, air, and room.

Over time, though, a community may adopt a new understanding of a word that is not imparted by the localized context:

“Our window is closing rapidly.”

While this word has little localized context to convey meaning, a broader analysis of the dataset and its inter-textual communications can reveal the meaning. Knowing that this is a transmission in a larger discussion between a pilot and ground control during a thunderstorm enables us to determine the meaning as a window of opportunity in which to perform something. Therefore, the inter-textual context of a dataset is also important in semiotics-based textual analysis.

The indexing process involves in-depth analysis (both intra-textual and inter-textual) of the dataset in order to determine semantic relationships and larger thematic elements within it.  Each word is examined and analyzed to reveal its semantic meaning in order to understand its context. That information is then used in defining larger conceptual themes. Although all words have some sort of context, not all of them are interesting or useful from a thematic perspective.

Contextually Relevant Information

CDP leverages a patented algorithmic approach to intelligently differentiate among contexts that are interesting and those which are not. This process enables CDP to provide contextually relevant information when a user requests it, regardless of what specific words the user employs to frame that request.

Next-Generation Textual Analytics

While CDP employs statistical and mathematical algorithms to analyze intra-textual and inter-textual contexts, it does not leverage static dictionaries, thesauri, or other linguistic references that must be updated as language changes. CDP derives meaning out of the content itself and never needs language updates or other modes of upkeep. This drastically reduces maintenance costs while still providing the benefits of next-generation text analytics.

And that’s the real value of CDP—providing textual analytics that derives meaning from words and documents in the same way humans do. Because semiotics as a science explores how humans combine words to convey meaning within context, the semiotics-driven indexing technology within CDP approaches text analytics in a more natural, human-like manner. The result is contextually relevant information that makes sense to users, helping them more naturally solve real problems within their organization.

Like this article and want a PDF version? Click below.


Read the original blog entry...

More Stories By Bill Roth

Bill Roth is a Silicon Valley veteran with over 20 years in the industry. He has played numerous product marketing, product management and engineering roles at companies like BEA, Sun, Morgan Stanley, and EBay Enterprise. He was recently named one of the World's 30 Most Influential Cloud Bloggers.