Solr – Tech Stuff

The 700 Level was the upper ring of the old Veteran’s Stadium, home of Philadelphia’s most rabid and loyal football fans. Season tickets were exceptionally difficult to come by, with a waiting list sometimes decades in length and often passed down from one generation of fans to another. In the late 90s, the website 700level.com went up and soon attracted a hardcore following among regulars, casual fans, and season ticket holders though a combination of creative interactive games and a discussion forum. Over time, the community grew and waned with the fortunes of Philly sports, but always retained a core group of knowledgeable, passionate, and loyal fans even a decade after the actual 700 Level disappeared in a cloud of dust along with the stadium itself.

That’s not what this is about.

This is about the cutting edge technology that powers the site, a combination of Angular.js, Bootstrap, Node.js, and Apache Solr, which is not the usual technology stack that drives websites dedicated to long-incinerated cheap seats. The latest incarnation of the site is what could be consider a “Search Based Application”, a software application in which a search engine is used as the core infrastructure for information access and reporting.

Why a Search Engine?

At its heart, all a discussion form does is organize freeform text, which is an ideal use case for search engines. The data model of a forum, while very implementable with traditional relational databases (indeed, the first 4 version of the site were built on top of SQL Server), recognizes benefits from denormalization and indexing.

Search engines natively work with highly denormalized data, where data is thrown together into the same table (or “core” in Apache Solr terminology). This reflects the simplicity of the problem space and doesn’t require the developer to create and manage multiple levels of the forum hierarchy and keep them in synch. There are really only two constructs of interest to a forum … users and posts.

Users work in fundamentally the same way as they do in a relational data store. Solr happens to keep an in-memory copy of frequently used data in much the same was as a distributed key-value store does, so by creating a Solr core and purposing it as a user database, the application developer gets low latency user authentication with the same level of security and repeatability that they are accustomed to.

Post data used to be partitioned across three database tables: forums, threads, and replies. While a relational database developer could easily denormalize this data, it’s not the first instinct. With a search engine as the data store, the forum & thread data is stored directly with the post itself as an attribute. This allows the search engine to facet the post documents by forum and thread. Faceting is a powerful feature of search engines wherein the engine classifies its contents and allows them to be explored along multiple dimensions, in this case forum, thread, and user. So querying the data store for a list of all threads part of a particular forum is no more difficult than doing a wildcard query (very fast) and faceting the results by the thread id/key/name (also very fast because it’s cached in memory). The same technique but faceted along a different dimension could give you all the posts by a certain user.

Search engines, because they index data when it’s loaded, also very efficiently do paging and sorting, a task that is troublesome at best and inefficient at worst to do with relational data stores. Paging is a natural function of a search engine, and every query is effectively paged.

It goes without saying that if you intend to have a “Search” bar in your application that takes a user to a page of ranked search results, you’ll see huge benefits to using a search engine instead of a relational database. The debugging tools available to developers will help you understand exactly why one search result appears before another so that you can tune the way you search your index to get the results your users are really after.

In the case of the 700Level, search was a long-requested feature. The members of the site have spent 15 year creating a repository of highly specialized football knowledge that captures both general and point-in-time insights into the team, the players and coaches, and the NFL in general. As the number of threads increased (approximately 6000 when we moved to the new forum software), it became increasingly difficult for users to find and reference previous discussion points. Search was a logical response to that problem, and certainly a search based application delivers in that area.

Another challenge that forums and forum users need to address is redundancy. It’s quite common for a forum to have a thread “fall off the main page” and then see a very similar thread spawn. This leads to virtually inevitable confusion on the part of some users and at least one or two posts that contain no meaningful content except a note that a similar thread already exists. Search helps address this by “cutting across” both forum and thread boundaries so that thematically similar threads and posts will naturally be grouped together during the search process by the engine itself. In effect, search gives a small measure of semantic meaning to posts, an additional level of self-organization above and beyond the vestigial and rigidly explicit hierarchies of data objects (one forum contains many threads, one thread contains many posts, each thread belongs to exactly one forum, each post belongs to exactly one thread).

Why Node?

Node.js lends itself naturally to search based applications because it is an extremely lightweight, asynchronous server platform. No matter how long queries take, node wastes no time or CPU cycles in a blocked state. When it encounters a blocking operation (such as querying the user database or running a query against forum posts) it simply begins the operation and moves on to doing something else until the operation completes, at which time it executes the callback function attached to the operation.

As the modern web development paradigm shifts away from the classic request-response model and to the asynchronous API model, node.js and future asynchronous server technologies (like Scala, and soon .NET) should only increase in popularity.

It makes sense in a larger application to use separate data stores for different data needs (for example, we could use Solr to store post data and something like Redis or Mongo to store user data), a technique being called “polyglot persistence.” Node’s asynchronous-by-default nature makes it an ideal technology to use as a layer of “glue” to bring together these separate data stores and present them to the world as a single, unified, and complete API. When fulfilling an API request that requires data from multiple data stores, Node will just kick off the requests to the data stores and assemble the results as they return. Certainly this requires a different style of programming and a new set of algorithms to tackle old problems, but that’s historically been the price to pay for progress.

Node itself is attractive because of its extremely lightweight footprint and its highly modular nature. Node’s package manager and ecosystem of robust open-source modules allows a node server almost infinite flexibility. “Middleware” like session management or real-time push functionality can be added and customized with a couple lines of code. The fact that node is a new technology means it comes without legacy baggage and can take advantage of the exploration done by other technology ecosystems like Ruby or Python.

Where can this take us?

Up until now, we’ve discussed how to re-implement 20 year old message board functionality. While interesting, it doesn’t answer the question “if it’s not broken, why fix it?” There are a couple of possibilities that open up with the new technology stack.

The use of node.js suggests real-time push capabilities, which is easily added with the “socket.io” library and a couple lines of code. One popular occurrence on the 700 Level is the “game thread”, a thread created by one of the users as they are getting ready for the game that people post to while the game is going on. While there generally is not overly useful content generated during the game threads (due to the highly targeted nature of the thread and the fact that watching football sober is a rookie mistake), the game threads are a fun community building activity that allow members to share the game experience from around the world. Turning the game thread into its own mini application where new comments would be pushed to users when they are created rather than pulled when a user happens to think of hitting refresh could enhance that activity.

Certainly the unfiltered nature of the game threads could provide some interesting sentiment analysis, as well. Supervised machine learning generally involves a human classifying a corpus of text and letting the machine attempt to divine what the patterns are in the data, and the positive/negative sentiment of a post is usually blisteringly obvious to another human so very few posts are wasted. While the 700Level itself extracts little value from this (we’re a free website and we pay the costs from our own pocket and from occasional t-shirt sales and donations, generating no revenue from advertisers), the technology itself might be of use to advertisers, marketers, or any of a host of other industries that feed on customer understanding.

The search capability offers the opportunity to begin to divorce from the user-imposed organization of posts into forums and threads. Now that we are allowing the search engine to index text, the occurrence of keywords and synonyms gives a certain level of self-organization to the corpus of posts. Rather than digging through the archives for threads about Reggie White, for example, you can just search the site for “Reggie White” and get all the threads in which that player’s name appears regardless of when or in which thread the post was created. You can see how this almost starts to create “meta threads,” that evolve and change over time with users’ interests and focus.

In time, it may be worthwhile to jettison the prior thread organization and allow users to search for topics of interest to them, turning the site into a highly personalized experience derived from their searches. It’s all still the same content, but the grouping arises spontaneously from user interest and not from foreign key constraints.

Category Archives: Solr

700Level.com – a Search based application with Node.js

Why a Search Engine?

Why Node?

Where can this take us?

Using Solr as a basic key-value store