One of the use cases for Redis is creating leaderboards.  Here are some Redis commands that I used in constructing a leaderboard data project.

This document assumes that you are a developer working with Redis for the first time and that you are used to working with SQL data sources.  The best source of documentation available is the Redis homepage (http://redis.io/) which lists and documents every aspect of every redis command available (http://redis.io/commands).  You don’t even really have to bookmark this page because doing a google search on virtually any redis-related topic will point you somewhere in this site.  Redis commands are pretty easy to pick up so this document is going to list out some frequently used operations as they relate to the StudioCycling project.

Setup

If you’re a .NET developer who is comfortable with Windows, the good news is that Microsoft has a port of Redis for windows.  There are two ways to work with it.  The first is to download an installer for a recent build (https://github.com/rgl/redis/downloads) and the second is to get the source of the most recent build and build from that (https://github.com/MSOpenTech/redis).  I used an old version.  It can still connect to our Redis instance but there are a couple of commands that are only in the newest versions of redis that the installer-version doesn’t have.  Fortunately, we don’t use any of them so you can use the installer version with all the code and examples from this project.

It’s probably not a bad idea to put the folder with “redis-cli.exe” into your path so you can use it from the command line.

Frequently Used Redis Commands

List keys (applies to all data)

KEYS <pattern>

list all keys: KEYS *

list all leaderboard keys:  KEYS lb*

NOTE:  Don’t run this in production.  KEYS is a blocking operation so redis can’t do anything else while executing this command

Get a Key (applies to Timeseries data)

GET <key>

so to get a key named “ts.1234567_55555.energy” use:  GET ts.1234567_55555.energy

Get a Hash (applies to Member data)

HGETALL <key>

so to get a key named “member.55555”, use:  HGETALL member.55555

if for any reason you only want to get one field, for instance age, use:  HGET member.55555 age

There is a “multi-get” command also that gets multiple fields, but our data sets are so small in terms of the number of fields that HGET and HGETALL should cover almost all of your needs

See Data in a sorted set (applies to Demographic and Leaderboard data)

ZREVRANGE <key> <start> <end> <options if any>

The <key> we know already.  <start> is the index of the first value to get and <end> is the index of the last.  Note that redis accepts negative numbers and interprets them as offsets from the end of the set counting backwards (so 0 is the first element, 1 is the second element, -1 is the last element, -2 is the 2nd last element).  The only option we use regularly is “withscores” which returns the score as well as the member.

So to get the full leaderboard for club 113 by distance for Oct 2014 use:

ZREVRANGE lb.113.distance.2014.10 0 -1 withscores

Note that ZRANGE is the same thing but sorted in the other direction.  We use ZREVRANGE for leaderboards because we want the highest score to be rank 1.

Filter a Leaderboard

ZINTERSTORE <output set name> <numkeys> <key1> <key2> … <key n> WEIGHTS <weight 1> <weight 2> … <weight n>

This one is pretty complicated and it’s only worth doing a couple of times to convince yourself that it works as expected.  The best way to handle these are to step through the .NET code and see how it works.

Here, <output set name> is the name of a brand-new redis zset that will be created to store the result of the intersection

<numkeys> is pretty much what it sounds like, the number of keys that you are going to intersect

<key 1> … <key n> are the names of all the keys

<weight 1> … <weight n> is a list of weightings to be applied to each set as the intersection is done.  Generally we set the leaderboard weight to 1 and all the rest to 0, which means that the final scores of the intersection set is equal to the scores from the leaderboard data set for each member.  Play around with this and see for yourself what it does if you have the time.

So to filter the leaderboard for club 113 by distance for Oct 2014 to show only male riders, use:

ZINTERSTORE tmp-john lb.113.distance.2014.10 demo.male WEIGHTS 1 0

and then do ZREVRANGE tmp-john 0 -1 withscores  to actually view the filtered leaderboard

Delete a Key (applies to all data types)

DEL <key>

So to delete the “tmp-john” key that you created in the last example, use:  DEL tmp-john

and it’s gone.  There’s no “undo”.

Redis allows you to list keys by giving a pattern and returning all the keys that match that pattern (KEYS ts* for example, lists all keys that start with “ts”).  You can delete all the keys by using FLUSHALL and you can delete one by using DEL, but you can’t natively delete a set of keys by giving the pattern using redis-cli alone.

But with linux (or if you have the git extensions folder in your path on Windows), you can do this:

redis-cli -h REDIS_URL KEYS “ts*” | xargs redis-cli -h REDIS_URL DEL

which deletes all the keys that start with “ts” in the redis instance pointed to by REDIS_URL.

If you are a scrub like me, you checked in your code to git before you added a .gitignore and now your git repo is full of things you don’t want.  This is especially important when you are working with a compiled language like C#.

Here’s how to delete all the files from your repo that would have been ignored by your .gitignore:

git ls-files -i --exclude-from=.gitignore | xargs git rm --cached 

The 700 Level was the upper ring of the old Veteran’s Stadium, home of Philadelphia’s most rabid and loyal football fans.  Season tickets were exceptionally difficult to come by, with a waiting list sometimes decades in length and often passed down from one generation of fans to another.  In the late 90s, the website 700level.com went up and soon attracted a hardcore following among regulars, casual fans, and season ticket holders though a combination of creative interactive games and a discussion forum.  Over time, the community grew and waned with the fortunes of Philly sports, but always retained a core group of knowledgeable, passionate, and loyal fans even a decade after the actual 700 Level disappeared in a cloud of dust along with the stadium itself.

That’s not what this is about.

This is about the cutting edge technology that powers the site, a combination of Angular.js, Bootstrap, Node.js, and Apache Solr, which is not the usual technology stack that drives websites dedicated to long-incinerated cheap seats.  The latest incarnation of the site is what could be consider a “Search Based Application”, a software application in which a search engine is used as the core infrastructure for information access and reporting.

 

Why a Search Engine?

At its heart, all a discussion form does is organize freeform text, which is an ideal use case for search engines.  The data model of a forum, while very implementable with traditional relational databases (indeed, the first 4 version of the site were built on top of SQL Server), recognizes benefits from denormalization and indexing.

Search engines natively work with highly denormalized data, where data is thrown together into the same table (or “core” in Apache Solr terminology).  This reflects the simplicity of the problem space and doesn’t require the developer to create and manage multiple levels of the forum hierarchy and keep them in synch.  There are really only two constructs of interest to a forum … users and posts.

Users work in fundamentally the same way as they do in a relational data store.  Solr happens to keep an in-memory copy of frequently used data in much the same was as a distributed key-value store does, so by creating a Solr core and purposing it as a user database, the application developer gets low latency user authentication with the same level of security and repeatability that they are accustomed to.

Post data used to be partitioned across three database tables:  forums, threads, and replies.  While a relational database developer could easily denormalize this data, it’s not the first instinct.  With a search engine as the data store, the forum & thread data is stored directly with the post itself as an attribute.  This allows the search engine to facet the post documents by forum and thread.  Faceting is a powerful feature of search engines wherein the engine classifies its contents and allows them to be explored along multiple dimensions, in this case forum, thread, and user.  So querying the data store for a list of all threads part of a particular forum is no more difficult than doing a wildcard query (very fast) and faceting the results by the thread id/key/name (also very fast because it’s cached in memory).  The same technique but faceted along a different dimension could give you all the posts by a certain user.

Search engines, because they index data when it’s loaded, also very efficiently do paging and sorting, a task that is troublesome at best and inefficient at worst to do with relational data stores.  Paging is a natural function of a search engine, and every query is effectively paged.

It goes without saying that if you intend to have a “Search” bar in your application that takes a user to a page of ranked search results, you’ll see huge benefits to using a search engine instead of a relational database.  The debugging tools available to developers will help you understand exactly why one search result appears before another so that you can tune the way you search your index to get the results your users are really after.

In the case of the 700Level, search was a long-requested feature.  The members of the site have spent 15 year creating a repository of highly specialized football knowledge that captures both general and point-in-time insights into the team, the players and coaches, and the NFL in general.  As the number of threads increased (approximately 6000 when we moved to the new forum software), it became increasingly difficult for users to find and reference previous discussion points.  Search was a logical response to that problem, and certainly a search based application delivers in that area.

Another challenge that forums and forum users need to address is redundancy.  It’s quite common for a forum to have a thread “fall off the main page” and then see a very similar thread spawn.  This leads to virtually inevitable confusion on the part of some users and at least one or two posts that contain no meaningful content except a note that a similar thread already exists.  Search helps address this by “cutting across” both forum and thread boundaries so that thematically similar threads and posts will naturally be grouped together during the search process by the engine itself.  In effect, search gives a small measure of semantic meaning to posts, an additional level of self-organization above and beyond the vestigial and rigidly explicit hierarchies of data objects (one forum contains many threads, one thread contains many posts, each thread belongs to exactly one forum, each post belongs to exactly one thread).

 

Why Node?

Node.js lends itself naturally to search based applications because it is an extremely lightweight, asynchronous server platform.  No matter how long queries take, node wastes no time or CPU cycles in a blocked state.  When it encounters a blocking operation (such as querying the user database or running a query against forum posts) it simply begins the operation and moves on to doing something else until the operation completes, at which time it executes the callback function attached to the operation.

As the modern web development paradigm shifts away from the classic request-response model and to the asynchronous API model, node.js and future asynchronous server technologies (like Scala, and soon .NET) should only increase in popularity.

It makes sense in a larger application to use separate data stores for different data needs (for example, we could use Solr to store post data and something like Redis or Mongo to store user data), a technique being called “polyglot persistence.”  Node’s asynchronous-by-default nature makes it an ideal technology to use as a layer of “glue” to bring together these separate data stores and present them to the world as a single, unified, and complete API.  When fulfilling an API request that requires data from multiple data stores, Node will just kick off the requests to the data stores and assemble the results as they return.  Certainly this requires a different style of programming and a new set of algorithms to tackle old problems, but that’s historically been the price to pay for progress.

Node itself is attractive because of its extremely lightweight footprint and its highly modular nature.  Node’s package manager and ecosystem of robust open-source modules allows a node server almost infinite flexibility.  “Middleware” like session management or real-time push functionality can be added and customized with a couple lines of code.  The fact that node is a new technology means it comes without legacy baggage and can take advantage of the exploration done by other technology ecosystems like Ruby or Python.

 

Where can this take us?

Up until now, we’ve discussed how to re-implement 20 year old message board functionality.  While interesting, it doesn’t answer the question “if it’s not broken, why fix it?”  There are a couple of possibilities that open up with the new technology stack.

The use of node.js suggests real-time push capabilities, which is easily added with the “socket.io” library and a couple lines of code.  One popular occurrence on the 700 Level is the “game thread”, a thread created by one of the users as they are getting ready for the game that people post to while the game is going on.  While there generally is not overly useful content generated during the game threads (due to the highly targeted nature of the thread and the fact that watching football sober is a rookie mistake), the game threads are a fun community building activity that allow members to share the game experience from around the world.  Turning the game thread into its own mini application where new comments would be pushed to users when they are created rather than pulled when a user happens to think of hitting refresh could enhance that activity.

Certainly the unfiltered nature of the game threads could provide some interesting sentiment analysis, as well.  Supervised machine learning generally involves a human classifying a corpus of text and letting the machine attempt to divine what the patterns are in the data, and the positive/negative sentiment of a post is usually blisteringly obvious to another human so very few posts are wasted.  While the 700Level itself extracts little value from this (we’re a free website and we pay the costs from our own pocket and from occasional t-shirt sales and donations, generating no revenue from advertisers), the technology itself might be of use to advertisers, marketers, or any of a host of other industries that feed on customer understanding.

The search capability offers the opportunity to begin to divorce from the user-imposed organization of posts into forums and threads.  Now that we are allowing the search engine to index text, the occurrence of keywords and synonyms gives a certain level of self-organization to the corpus of posts.  Rather than digging through the archives for threads about Reggie White, for example, you can just search the site for “Reggie White” and get all the threads in which that player’s name appears regardless of when or in which thread the post was created.  You can see how this almost starts to create “meta threads,” that evolve and change over time with users’ interests and focus.

In time, it may be worthwhile to jettison the prior thread organization and allow users to search for topics of interest to them, turning the site into a highly personalized experience derived from their searches.  It’s all still the same content, but the grouping arises spontaneously from user interest and not from foreign key constraints.

The key to doing this is to pipe the output of a directory list to another Linux command.  For windows users, this may feel unfamiliar but it’s one thing about the *nix operating system that is really powerful for developers.

We will build this up in pieces.  First, posting an xml file to solr is done like this:

curl http://localhost:8938/solr/CORE_NAME/update?commit=true -H “Content-Type: text/xml” –data-binary @file_name

We can list the directory of files normally with “ls” and pipe the results to our curl command using “xargs”

ls POST_DIRECTORY | xargs -I % curl http://localhost:8938/solr/CORE_NAME/update?commit=true -H “Content-Type: text/xml” –data-binary @POST_DIRECTORY/%

The trick here is the -I flag of xargs, which basically makes the input a variable that you can place into the executed command wherever and however you want.  The end result is every file in the “POST_DIRECTORY” subdirectory from where you execute the command will be posted to the “CORE_NAME” core.

 

This is the data access layer I used to use in my C# days before Entity Framework came out and before I started working more with noSQL data stores.  It’s great at hitting relational data stores when you’re willing to write the SQL for it.

 

In the process of prototyping a Solr-based application I found myself needing a quick-and-dirty database.  Rather than introduce another technology into the prototype, I tried to take advantage of the fact that Solr is effectively a fast key-value data store.  In my application, I’m writing the middleware in Python  using the Solrpy module, but behind the scenes all it’s doing is assembling the URL and handling the plumbing of sending the request and parsing the response.

Assume I’ve added fields called “username” and “admin_datatype” to schema.xml and my collection is called “admin”.

import solr
import uuid
import urllib

class UserManager:
    def __init__(self):
        self.data_collection = "admin"
        self.base_url = "http://localhost:8983/solr/" + self.data_collection

    def GetUserID(self, user_name):
        s = solr.SolrConnection(self.base_url)
        filter_query_username = "username:" + user_name
        filter_query_datatype = "admin_datatype:User"

        response = s.query('*:*', fq=[filter_query_username, filter_query_datatype])
        intCount = len(response.results)
        user_id = ""
        for hit in response.results:
            user_id = hit['id']
            
        return user_id

    def AddUser(self, user_name):

        s = solr.SolrConnection(self.base_url)
        s.add(id=uuid.uuid4(), username=user_name, admin_datatype="User")
        s.commit()

I use the “admin_datatype” field to simulate tables, so we are tagging each row of data with a value that we later filter on.  Solr can facet the data by the unique values in this field which makes the queries even faster.  More complex data usually just means more fields in the add and select queries, and in the case of the code above more parameters passed to the methods.  In application development it’s generally useful to have each object/row of data have a unique identifier, and solr requires this anyway, so you could easily use solr to simulate the basic functionality of a document database like Mongo or even relational tables.

In general, this technique will keep you moving when doing R&D for search-based applications.  I’m not sure I’d want to build the whole app around it.

 

 

 

I used this command to copy all the downloaded maven JARs into a deployment directory I was using to build a java project:

find ~/.m2 -name *.jar | xargs -I file cp file ~/deploy/workspace

Let’s take a look at what it does.  The first part, “find ~/.m2 -name *.jar” is the basic use of the Linux “find” command.  We are telling it to look in the ~/.m2 directory (which is where maven caches its dependencies) for all files that match *.jar

Next we pipe “|” those results into a command called xargs. What xargs does is run a shell command once for each input line you send to it.  In our case, it will run once for each full file path that is returned from “find”, so you can see where we are going with this.  Xargs knows that you might want to run complicated commands, so it’s programmable.  By using the “-I” switch, we are basically declaring a variable, so “-I file” means “declare a variable called file that holds the input argument”.  Next comes the actual command:  “cp file ~/deploy/workspace”, which takes the “file” variable (which you remember holds the input parameter, which you hopefully still remember contains the full file path to one .jar file) and sends it to the “cp” command.

The net effect is this:

every .jar file in the local maven cache gets copied to ~/workspace/deploy (or wherever you want the files to go)

I ran into an interesting problem today.  I was trying to increase the maximum number of open file handles allowed by the OS (a server application I was working on wanted lots of file handles available).  I tried:

$ sudo echo 2097152 > /proc/sys/fs/file-max

but got “permission denied”.  I found this trick to make sure that sudo passes its permissions all the way through the whole operation:

$ sudo sh -c “sudo echo 2097152 > /proc/sys/fs/file-max”

and it worked just fine.

First off, let’s install some basic software that we’ll need.  You may have versions of these installed already, so you may be able to skip some of these steps.

$ sudo apt-get install g++
$ sudo apt-get install uuid-dev
$ sudo apt-get install git

INSTALL JAVA

Now we’ll need to install Java.  This is a painful process to do manually, but fortunately there is a way to do it with apt-get:

$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-jdk7-installer
$ update-alternatives –display java

Next you’ll need to add JAVA_HOME to your /etc/environment file.  Add:

JAVA_HOME=/usr/lib/jvm/java-7-oracle

to the end of the file and save.

INSTALL MAVEN

We’re making progress, but still not done.  We need to install software called Maven to manage the build process.  Fortunately this can be managed with apt-get:

$ sudo apt-cache search maven
$ sudo apt-get install maven

I ran into a broken installer when I tried this, at least it didn’t build properly on Ubuntu 13.04, so  I had to fix the installer this way:

$ sudo dpkg -i –force-all /var/cache/apt/archives/libwagon2-java_2.2-3+nmu1_all.deb
$ sudo apt-get install maven

now, test the Maven install:

$ mvn –version

INSTALL STORM

Now, the moment we think we’ve been waiting for, the storm installation.  We’re going to do this in a very quick way.  The first step is to download the latest version from http://storm-project.html/downloads.html

Unzip that file, rename the directory to storm, and move it to your desired location.  Change to that directory and install 0MQ (a low-level socket interface used for distributed message passing):

$ ./bin/install_zmq.sh

You’ll probably want to add the storm/bin directory to your path.

TEST STORM WITH THE SAMPLE PROJECT

We’re going to need something called Leiningen to build this sample project.  You can install it by running:

$ wget -O ~/bin/lein https://raw.github.com/technomancy/leiningen/stable/bin/lein
$ chmod 755 ~/bin/lein

You’ll need to make sure that ~/bin is in your PATH.

Here I will assume you installed storm into /opt/storm/ .. so create a directory for storm projects and change to it

$ git clone http://github.com/nathanmarz/storm-starter
$ cd storm-starter
$ lein deps
$ lein compile
$ lein jar

$ cd target
$ java -cp /opt/storm/lib/*:/opt/storm/storm-0.8.2.jar:storm-starter-0.0.1-SNAPSHOT.jar storm.starter.ExclamationTopology

that should run for a while