Patterns in Full-Text Search (2 of 2)

Posted by Cuneyt Uysal, Manager of Professional Services, RedDot Solutions

New York - Tuesday, November 21, 2006 11:15:00 AM - In the second part of this series, I'll focus on integrating security and extending our example to handle dynamic associations.

 Security


In our bookstore, we sell monthly subscriptions to a wide array of digital format eBooks, such that the customer can purchase access to all the books of a particular category, say “Gardening.” It is important that the customer can only search for text within the zones of the eBooks to which she has purchased a subscription. However, one can search within the abstracts of all categories of Book Listings in the store.

The key to implementing this security model is in having a clear schema that accounts for this relationship in content (documents) and user profiles. We use the term content, to denote a more general concept of digital assets, be it HTML pages, Adobe PDF documents or a QuickTime movie.

So now, we can update our queries to include concepts of user management and security models.

( Abstract <CONTAINS> “hibiscus” ) <OR>
( <SOUNDEX> “hibiscus” <IN> Full-text ) <AND>
( Category <MATCHES> [#USER:Subscription Category#]” )

Notice how the query contains a new syntax element, such that an attribute from the current user’s profile is dynamically integrated into the query. The generated query could actually look like:
 

( Abstract <CONTAINS> “hibiscus” ) <OR>
( <SOUNDEX> “hibiscus” <IN> Full-text ) <AND>
( Category <MATCHES> “Gardening” )


Which would satisfy our security model. Now, we could take things a step further and restrict access on a lower level, and that is to implement something known as a content constraint. This is handled at the web server level, not at the query level, and simply restricts whole access to certain pieces of content, whether it is from a search results perspective or directly requesting the document by name from the web server, such as a link in a URL. This federated security model is key to establishing a secure web presence and protecting valuable digital assets. Most modern integration with full-text search engines will adhere to this philosophy. Otherwise, you can expose inconsistencies in security that could be compromised in a malicious attack.

 

Association

We see now that the prior example not only allowed us to implement the security model, but it provided a relationship between a class of user, Customer, and the all classes of content that have a matching Category. This is analogous to the principal of association in Object Oriented Design (OOD), which is a more robust way of perceiving your content and user classes. We will focus on the content centric perspective in this document.

Not only are there associations with users and content, but from one content class to another. We could have had a web page, which described a store in a particular language, say the Spanish eBook Store, and the book titles available at the store, as they could vary from locale to locale. Now, we could create an association at either cardinality, at the book level or at the store level, whichever is more appropriate. We will see how certain forms of association are so common, that they have emerged as common patterns. The Wrapper pattern is a good example.

 

Wrapper Pattern and Composition

We often find that there is metadata of a document that cannot be simply administered by direct modification of the document. For example, we can insert metadata into the HEAD portion of an HTML document, however we cannot perform the same task when it comes to a QuickTime Movie as the media does not support this.

So, to overcome this, there is typically a descriptor file, which contains a reference or multiple references to external documents. These serve as a wrapper for the valuable document, and are purely used for the purpose of adapting the piece of content for metadata capture when imported into the full-text search engine repository. Alternatively, database driven systems can have specific tables for storing metadata, although this less generic than simply using a corresponding XML file for wrapping.

Additionally, there are cases when HTML pages have metadata assigned to them, as well as contain references to other content items, such as links to Adobe PDFs. This is an example of content aggregation, or composition. The power in this pattern is the ability to assign metadata to a single, uniform location, on the whole HTML page, and having it inherited to all the children content associated within the base document. This eases metadata administration for content authors.

 

Inheritance

Although we discussed inheriting attributes from a base document, this is not related to the concept of content class inheritance. Inheritance refers to the OOD tenant of a class hierarchy, in which a descendant of a class will maintain all of the abilities and fields of it’s ancestor. This ensures backward compatibility when developing newer classes, as well as avoiding replication of work in this process.

In the context of full-text search engines, we can follow the principal of inheritance be ensuring that similar content classes, that share related fields, are grouped accordingly. That is, in our example, we have the two content classes Book Listing and eBook in PDF Format:

Diagram 3:

Book Listing
- Title
- ISDN
- Category
- Author
- Abstract

eBook in PDF format
- Title
- ISDN
- Category
- Author

We can see the overlap in these two classes, and by placing them into related content group hierarchies, such as:

  • book-html-content - >Book Listing content
  • book-blob-content -> eBook in PDF format content

 

The content group hierarchy is actually designed much in the same way that a classical OOD class hierarchy is. The generic base class is book, and the subclasses are the HTML version (book-html-content), and the BLOB (book-blob-content) version. As a result, we can construct elegant queries that harness the idea of inheritance.

For example, to simply search for a keyword in any of the book classes we can use the following:

“hibiscus”  <AND> ( GROUP  <.SUBSTRING.> “book”” )

This query will match both the book-html-content and book-blob-content groups.

More importantly, is the ability to implement search queries that are abstracted from the actual content type. That is, the developer should be able to make elegant parametric queries based on metadata, regardless of type of media considered. For example:

( Author <CONTAINS> “John Paul Jones”  ) <AND>
( Category <MATCHES> “Gardening” ) <AND>
( GROUP  <.SUBSTRING.> “book” )


 

The above is a valid query for both the Book Listing and the eBook class since they both offer the Author and Category fields. Results from both content classes would be valid, however the application developer did not have to enumerate a list of “valid” classes. The developer could add new content classes, such as a book-xml-content, and as long as it inherited the same fields, the query would also work flawlessly without any modification.

 

Dynamic Associations: Targeting Related Content

The crux of a metadata driven full-text search solution is the ability to drive relevant content to the right consumers. Whether used for commercial targeting or just internal business communication, providing users with meaningful content is only effective when considering the current user’s profile. That is, by dynamically creating lists of links to relevant content in real-time, based on the current user’s desires and preferences, the user experience becomes much more efficient, compelling and fulfilling.

Now, this type of targeting, also referred to as implicit and explicit personalization, can be achieved without the use of a full-text search engine, however would require extensive custom programming to generate and manage profiles and integrate this with an effective mechanism for querying content. Managing the entire content lifecycle, from authoring, management, publication and final consumption is task best suited by a fully integrated content suite.

To reinforce, the aforementioned features of modern systems allow the application developer to quickly create extensible, and more importantly, flexible search queries that can be adapted in short development cycles inline with changing marketing and business needs. This is the greatest pitfall of using homegrown solutions that are not designed from the ground up for rare changes, if ever. As the market changes, so will the business rules and logic behind your targeting strategy. By using the more natural language of full-text search engines, it is much simpler to adjust these rules in face of new customer trends. In fact, it is possible to make configure these queries in a content managed environment such that non-technical authors can update the business logic, placing control of the targeting within the hands of traditional marketing and communications staff who know the business landscape best.

There are several potential candidates for targeting use cases. Below are some samples based on our bookstore example.

  • Get all books that a customer has / has not yet bought
  • Get all books from the customer’s subscription category
  • Get all eBooks from the customer’s favorite authors

The list grows with the amount of metadata and profile information captured. The more robust and relevant your taxonomy is, the more you can leverage these tools. Ultimately, the key is to adopt a content- and object-centric view, as opposed to a data-centric view, when addressing targeting. A data centric view is limited in that it focuses on the data itself, without interest to how the data is used, or acts. These actions are key to an object oriented approach, and have proven themselves to be ideal for modern application development. 

 

Conclusion

The techniques, concepts and underlying practices that this article is based on is derived from the subject matter expertise developed from leveraging the Verity K2 Search Engine, bundled with Reddest LiveServer. To learn more about this exciting technology, please visit the RedDot website here.

Bring the power of Verity K2 search technology, and provide access to any unstructured text, documents, presentations, Adobe PDF files and virtually any other file residing online. Used together with RedDot LiveServer Personalization Manager, this highly efficient search engine can even refine search results based on a user’s personalization profile.

RSS Feed
 
Attention RedDot Customers:

We have moved our RedDot Community content to the Open Text Knowledge Center. Please go to knowledge.opentext.com

Blog Archive 2006 - 2007 Patterns in Full-Text Search (2 of 2)