By Sankalp Sharma, CTO, Sportskeeda
I distinctly remember the joy of seeing blue bars appear in Google Analytics’s “Right now” box, when I first installed it on my little blog thingy. Trickling in as soon as someone landed on my blog, it was like magic.
Young and wild as I was, and largely jobless, having just finished the 2nd semester exams back at my undergrad, I immediately started picturing how I would build this if I had to.
At first it seemed simple. So I started to jot this down in my “Ideas” notebook.
In my thought, doodling architecture diagrams, as I went from single website to multi-tenancy, to scaling this for traffic, my mind did a couple of double-takes.
At the time all that I was equipped with, was a tiny knowledge of PHP, and a little JavaScript, the kind you’d find on StackOverflow’s second-to-top answers.
Hi there, Sankalp this side. You will hear from me more in italics. And this is the story of why and how we built an in-house real-time web analytics system at Sportskeeda.
Back at college, little did I know a similar system would one day be a full-fledged project which I’ll be working on. I just had to wait 10 long years, before I got to build this with a team of 3 passionate engineers.
Fast forward about a decade, circa 2018. We at Sportskeeda, were trying to build a niche “ranking algorithm” for our feeds, and very quickly realized that one of the core ranking features had to be the “live traffic” on a particular article.
So, for instance if an article was 10 minutes old, vs another one being 10 hours old, most likely the “10 minutes old” article should show up first on the homepage feed, unless if that older article had 10,000 + people actively reading it, sharing it, commenting on it, while the “10 minutes old” article had just 150 readers.
When you see these loading tiles on our homepage (usually for < 100 millisecond), our algorithms are sweating away to give you the best possible feed based on what you probably like
So when you access the homepage, our algorithms have to determine what to show you here based on your preferences. And based on what other folks like you are reading at the moment, have read in the past, and how closely it matches with your “affinity”, we should recommend you as to what you should probably read next.
Now this is an ever evolving system, and for any foreseeable future, it will be far from perfect.
But this story is about one small part of this system – the live traffic score system, or as we internally call it, Ska (stands for SK Analytics, pronounced s-k-a-a. Just kidding, we just call it S.K.A. DM for better name suggestions). This is the story about what led to the decision of building this, and how did we eventually build it.
It also involves us exploring the idea of open sourcing this piece.
Let’s start from the beginning.
Part 1 — Build vs Plug In
Our first obvious thought was Google Analytics (hereafter also referred to as GA) “realtime traffic” API. We spent a couple hours hammering together a small spec, then a small prototype which plugged into GA’s API and tried to get us the top “URLs” which were ranking. We cleaned up the UTM parameters, other unnecessary query strings, hit our datastores to get the post ID against that “canonicalized” URL, and stored them in a Redis key-value pair.
Like a neat distributed system, the “Ranking Algorithm” considered all the articles in the “consideration set”, looked for their read counts in Redis, cached them at its own data stores for a few seconds, and used that cached copy for computing the ranking.
In a day, a demo was ready, but (during the demo to the intended primary users of this, funnily enough) we realized that not all the URLs which seemed like an “article” pattern were actually articles. Half the “articles” which we were matching and trying to score against, were not actually articles, but some other pages, usually some kind of taxonomy page.
On top of that, when one of the developers accessed an article to demo how the “live user” count would go up by one immediately, it didn’t. The developer had an (gasp, gasp) Ad blocker installed. A lot of ad blockers block out Google Analytics. Fair. And our data had to be a sample set for the most part, but with this state of things, it would be incorrigibly skewed.
What follows is a dramatic representation of what went down on a fine Thursday afternoon. It happened so long ago, at this point it is just a reconstructed glib. I distinctly remember Thursday because just the previous night we’d had our “mid-week team party” and I was chugging extra caffeine that day.
While we were writing the spec to solve this to progressively filter the data, we took a chai break. Over chai, one of us chimed “It’s almost as if we need a separate GA property, dedicated only for articles!”
B: “Could we even do that, if, hypothetically, we needed to?”
C: “That would interfere with the master reporting, which all the teams rely on.”
A: “Not if we could logically isolate it and run 2 GA properties on the same page?”
B: “Is that even possible?”
With the last gulp of chai, one of us said “What if we build a small GA of our own, which only tracks realtime traffic?”
C: “Sure, it doesn’t even have to be persistent”
A: “We could use a Redis datastore with finite expiry, even.”
B: “But we will need to build a CSRF on top of everything.”
A: “We already have a CSRF system, why don’t we hook that up as a middleware?”
With a decisive tink of the steel chai glasses on the steel table of Udupi Park in Indiranagar, all of us said some version of “I think we can build it”.
*Anti-climatic running to the office to reach the whiteboard*
Part 2 —Building it
Now before we jumped the gun and coded our hours away, we decided to write a quick spec. The more we went deeper, the more we realized this might be one of the most painless things to build & maintain. A very simple ingestion service, hacked together in a neat async language. Node was the flavor of the month, so node it was.
This is the first broad level architecture –
Note that what we essentially built from scratch was the Ska service, and a small piece of JS which was dropped on article pages. This piece would just register each fresh session, and would renew that session every time the user registered an activity.
What does it mean for a user to be “active” on an article?
We broadly defined someone as active when they had interacted with the web page in the last 60 seconds. Interaction could be a scroll, a tap, a toggle etc.
Once a user didn’t interact with a webpage for 60 seconds, the Redis key corresponding to that session, conveniently set with a TTL (time to live) of 60 seconds, would expire.
And we got a neat little dashboard like this –
Fun fact, this one time when Google Analytics crashed on us at 200k+ concurrency, we relied on Ska for a while to see how the realtime traffic was doing. When Porush pinged me that he was checking traffic on Ska because GA had crashed, I took a moment to look at our alarms dashboard (all was well), and then replied smugly “merits of a clean, simple system”. 😆
Net cost of hosting the “entire” infra for Ska at the time was $10. Apart from a (sighs) disk storage full issue which caused the self-managed single-node Redis to go down once, there was zero downtime for a year. This was a massively successful project used in production for a long time without us having to think about it. We almost forgot about it. Even that one time it went down,there was no domino effect anywhere, as everything was loosely coupled and gracefully short-circuiting in case of an outage. Our plans to “productionize” this as per the spec never saw the light of day, that is UNTIL…
Part 3 — Evolving with the brave new world
While building and scaling Ska, we learnt a lot about how ingestion systems should be designed. We burnt our hands a lot with spurious data, traffic spam, single points of failure.
More importantly, by this time, we had a one-of-a-kind revenue share system for our thousands of writers, and Cloud SQL used there wasn’t scaling for us. Despite our attempts at indexing and creating hand-written materialized views, querying sequentially, the sheer volume of data we had by the time (over 800 million rows), and the sheer variety with which it needed to be queried, it just didn’t hold up.
When latency spikes became menacing, we decided to move the primary datastore to Elasticsearch. This involved us having to build an additional ingestion endpoint for articles. But we already had an ingestion endpoint.
And so, Ska evolved.
Almost like before, but each node which you see now are actually multi-node, highly available and fault-tolerant systems
Part 4 — Current state of things
We grew – leaps and bounds since then, both in terms of traffic and scale, and our systems had to keep up with that. To simplify and reduce the calls which we dispatch from the client side, and to unify the “analytics pixel” which we use use across desktop, mobile and AMP (Accelerated Mobile Pages), we extended the pageview ingestion queue to also be indexed and persisted on a large Elasticsearch cluster. This cluster has to be large enough to keep up with data which increases by several million rows each day.
The pageview data is now persisted for audit, analytics, and revenue share
On top of that, to ensure that our revenue share writers get to see their revenue in milliseconds instead of several seconds, we built a smaller “crux” document on Elasticsearch.
I’m sure by this point you realize our “give-things-an-obvious-name” syndrome. We sure could have called crux something from the Marvel universe, or from the Greek mythology, but we just called it “crux”.
Crux, as the name indicates, is an informed summary of a large group of records. It takes up the complete page view for a particular period, and groups them into page views per article, and per country (country where the page was viewed). For over 4 Million page view records, crux-ification shrinks the dataset down to just above 36k records. Most actively queried dimensions are chosen for this crux.
Reducing records while not losing any data — that’s free compute power
This groundbreaking crux is almost magical when coupled with distributed shards of Elasticsearch. We expected positive results, but it was off the charts, really. Bear in mind, this is not an everyday consumer data, but a complex report, which our writers were habitual of waiting for, for about half a minute!
When you’ve have some well-earned money, would you rather see it in milliseconds or several seconds?
When our amazing diverse group of content creators are happy, we unabashedly call it a “direct business impact”.
Part 5— Plans to Open Source
Sportskeeda was built on Open Source. We were a WordPress blog borne out our CEO’s passion for taking sports creation and consumption to the next level. We have come a long way from that in terms of engineering, but we still use a ton of Open Source software across our stack. We also give back to the Open source community, and intend to give back even more in the future.
Our developers have contributed (actual code contribution, not just fixing typos in documentation, although that is equally important work) to projects like Firefox, Insomnia, React Native Admob, Google Analytics PHP interface, etc.
And we see the huge value that the community could derive from the tools which we have built in-house with huge passion. We are planning to open source a version of Ska, and some bash utilities which help us in tuning and scaling the infrastructure.
Tech and sports are by far two most exciting disciplines invented by us humans. We combine the two — imagine our excitement!