We recently launched our Social Reading feature here at RockMelt. When you opt in, Social Reading shares with your friends the URLs you visit from selected sites, and it allows you to see which articles your friends are reading. Inside a busy Feed window you can now see which of your friends have read which articles, which articles are most popular overall, among your friends, and trending. This can help you make decisions about which articles to read among a sea of articles.
We are big fans of Cassandra (http://cassandra.apache.org) here at RockMelt and we use it for almost all of our persistent data needs. So the natural place for us to store data associated with Social Reading was Cassandra. But how to store Social Reading data in Cassandra to ensure the best performance?
Using Cassandra’s Counters (http://wiki.apache.org/cassandra/Counters) we can easily keep track of how many users visited a URL. We can also use Cassandra’s column TTL support (http://www.datastax.com/dev/blog/whats-new-cassandra-07-expiring-columns) to “age out” old data without having to write scripts to cleanup the database for us.
You can think of data in Cassandra as being stored like a really big Excel spreadsheet. Imagine each row in the spreadsheet has a key. (Excel just gives them numbers like 1, 2, 3, …) Now each row can have as many columns as you want. Each column needs a unique name (in Excel columns are named with letters starting at A to Z and then AA to ZZ, etc). One thing different from a spreadsheet is not all rows have to have the same columns; you can make them arbitrarily different per row.
We considered two possible schemes for storing Social Reading data in Cassandra:
• A mapping from URLs to Users who read them
• A mapping from Users to the URLs they have read
It is not immediately obvious which is best.
Storing the data in either case is basically the same: In case one, the row key is the URL and the column name is the UserId. In case two, the row key is the UserId and the column name is the URL. Which one is best?
Let’s consider the requirements for querying the database. Our browser takes a group of URLs (representing a group of articles from an RSS feed) and sends them to our backed. The backend looks up your set of friends and asks the Social Reading component (we call it OGRE) which friends read which articles.
No matter which way you order the data, the query OGRE performs for the N URLs and M friends basically looks the same. For N URLS look up each row by URL and search through M UserID columns to find which of your M friends is in that row. Or, for M friends look up each row by UserId and search through N URLs to see which are in that row’s columns.
Either way we have NxM lookups to do. But in general the number of URLs is a smaller than the number of friends. (Our feeds each hold a maximum of 100 URLs, and our average user has over 450 friends.)
So the questions are:
• Does one method give better 90th-percentile (TP90) performance?
• Does one method have better worst case characteristics?
Let’s consider how the data actually looks in the database in both models and specifically what it looks like at the extremes. Turns out that (like many things on the Internet) the way people read articles follows a log-log pattern. This means a huge number of people read a small number of articles and then there is a large vertical drop off leading to a really really long tail of a huge number of URLs with only one or two people reading each one.
In our URL-to-UserId model we look at N URL keys and M UserID columns. In this model each row has a “non-capped” number of UserId columns.
Lets see how this does against the log-log pattern:
• TP90 performance: We would have to look up N keys and then look for M matches in each key. In most cases nobody or just a few people have read any URL and since N < M we would have to look at fewer keys in Cassandra. Seems pretty good.
• Worst Case: For a popular site we have to look at the same N keys. But each row would have many many columns. So that means it would take a long time to scan on disk. In addition since it is such a popular site it is likely that lots of people are looking at, and modifying, this really large dataset all the time. Yikes!
In our UserId-to-URL model we look at M UserID keys and N URL columns. In this model each row is limited to the number of articles a human can read in our data retention period.
Lets see how this does against the log-log pattern:
• TP90 performance. We look at more rows on average than the URL-to-UserId mapping (since N < M) but M is on average < 1000 so as long as we can get good enough performance out of Cassandra (and oh boy can we!) we should be good. Also, the average person just doesn’t read that much in a day. Even a heavy reader of 100 articles a day is only reading 3000 articles a month.
• Worst Case: In this case the worst case is a person with 5000 friends (the maximum Facebook allows) whose friends are all really heavy readers. This would be 15M records (over a month) to check for that person. Good news is this is only for that person. We have one user affected in this case. Unlike the URL-to-UserId case where lots of people would be affected by the very popular sites.
In the end we settled on a UserId-to-hash-of-URL mapping, since the TP90 performance meets our criteria and we don’t have to worry about popular sites.
Feel free to email engineers@rockmelt.com with any questions.
-
xbox360-live-points-online reblogged this from rockmeltengineering
-
how-to-mod-xbox-xbox360 reblogged this from rockmeltengineering
-
online-xbox360-free-electric reblogged this from rockmeltengineering
-
nintendo-wii-konsoll-xbox360 reblogged this from rockmeltengineering
-
xbox360-games-latest reblogged this from rockmeltengineering
-
dai reblogged this from rockmeltengineering
-
rockmeltengineering posted this
