A comparison of Azure Data Science platforms

7 min readJan 18, 2020

Someone on a chatroom I’m in was asking about using Azure Databricks, so I thought I’d take a look at what it was all about for a project I was thinking of doing.

Secondly, while I mentioned previously that Azure Notebooks was only good for prototyping, I soon discovered the ability to connect an Azure Data Science VM to a Notebook, which probably merits giving the Azure Notebook platform a second look.

Butter up

As you know, I have been quite interested in the volumes of passenger data generated by the Land Transport Authority of Singapore with regard to the usage of public buses and trains.

A question came up through the analysis of that data — can I use it in order to find the most congested section of the MRT in terms of passengers? There’s two parts to this:

First, develop a path generator that finds the shortest possible path for passengers to take between every pair of stations
Second, “walk” the paths generated and apply the actual origin/destination data on every vertice within the path

The road ahead

We start with the path generator, since it’s needed for the second bit (obviously), but this is the more boring bit. Of course, I could simply avoid writing my own path generator and just use Google Maps API calls or something, but I’m doing something on the order of 180² path calculations, and I don’t want to get banned from Google Maps for excessive API usage. Besides, while a self-made path generator would be computationally intensive, I should only be doing this once or twice a year when new stations open.

The idea is to build a representation of the MRT map in code for a Dijkstra’s pathfinding algorithm to walk, while representing actual travel times in order to influence the route chosen.

The actual MRT system is a lot more complex, so let’s explain the concept with an example. Take this sample “system” with two lines, ABC and DBEF. Assuming a path is taken from A to E, what we need is to determine the path that is taken — in this case A-B-E.

We then combine them with the passengers travelling A-B and B-E in order to get the real loading on those two segments — but for B-E, it’s not so easy. Since passengers travelling B-F also have to pass through E, we’ll have to throw them in somehow.

Confusing? This is where the pathfinding algorithm comes in. On my computer, I’ve generated the route data for getting between every pair of MRT stations. The paths I have follow the fastest route between a pair of stations, similarly to what Google or MyTransport app would tell you — although since using unpaid interchanges count as two trips, the pathfinder will go the long way to keep you within the system. You can ignore those or decide that people are weird.

Battle of the data platforms

With the question of routes settled, now we talk about walking the paths and processing the data.

I did it myself on my own computer but it took hours given that I run on a toaster approaching its fifth birthday. That in mind, let’s look at what Azure offers us:

Data Science VMs that you can remote into and do your work there (which you can also connect to Azure Notebooks)
Azure Databricks

The most user friendly one, at first glance, is Azure Databricks. The setup wizard even deploys all the necessary compute resources for you, so all you do inside Databricks is to write your code and run it. One point for Databricks.

Here, though, the devil is in the details. Databricks uses Apache Spark as its main compute platform. While it supports Python, which I’m more used to, the use of Apache Spark means there’s a learning curve to things. I’ve also found that Spark appears a lot slower for my workloads compared to running locally or in Azure Notebooks.

Azure Notebooks, as mentioned previously, provides a more visual platform for prototyping code compared to the Python shell with Jupyter notebooks, a familiar environment for those who are trained on Anaconda (like me). The free compute tier can be quite slow, and it’s useful for small prototyping work. The big boys can arguably use something better — like how you can connect an Azure data science VM to Notebooks. One point to Notebooks for familiarity, another for extensibility.

However, I’ve found that when doing personal work, this can be quite troublesome. All the DSVM provides is a JupyterHub server, and you have to pull the credentials and other such connection information from the Azure portal for Azure Notebooks to connect to. It’s not as easy as Databricks where there’s a big button in the Portal which you can just click on to get into your data workspace, which then automatically enables all the compute resources it needs.

Of course, Notebooks does have its convenience features to connect to a DSVM, but you have to be using Azure Active Directory (another can of worms) to do so. That’s obviously out of my league, and no point setting that up. So I’ll just have to deal with this.

In short, Notebooks 2 Databricks 1.

What do I see?

Now let’s get back to this project. If you came here from FTRL, this is likely what you’re interested in.

After letting the notebooks hum away in a datacenter somewhere, what we get back is a bunch of Excel files. These Excel files tell us how many people are predicted to use a train line between a pair of stations. However, since we don’t have a surefire way to know who took what route for sure, the data should ideally be moderated with those data collected by the folks that are hired to count people on platforms. Without that data, for now we can take what we have here with a pinch of salt, but we’re doing all this for science anyway.

Like all other datasets created, I’ve also sectioned the data by hour. From this, we can see what are the MRT lines that are really in need of help, because they’re reaching the point of overcrowding. As a reminder, an “acceptable” loading of passengers is 1600 for a 6-car train, which we can scale down to 800 for a 3-car train.

So, here’s a quick summary, as of December 2019:

The NEL is really in big trouble. Farrer Park to Little India sees around 33670 passengers a day between 8am and 9am, which is slightly below 22 full trains. Given that the design of the NEL train means there can be some wasted space when folks don’t move in, it could mean the NEL is a lot more crowded than we think.
NSL out of Jurong East is really not that bad. The most crowded section of the EWL is actually Clementi to Dover, standing at 23495 between 7am and 8am, and 21779 from 8am to 9am. As for the east side, things top up at around 19k between Lavender and Bugis, and the Circle Line does relatively little to ease the crowd at Paya Lebar.
Speaking of the CCL, between Lorong Chuan and Serangoon it really cries for help, with 15463 passengers between 7am-8am, and 17425 passengers between 8am-9am. Since we’re dealing with half-sized trains here, we can also see that it’s proportionally similar to, if not slightly worse, than the NEL’s issues.

Maybe we can talk more about this on FTRL sometime later since there’s a variety of operational issues we should look at. Though, for now, remember that the numbers I quote aren’t distributed evenly across the entire hour, it could be such that the travel patterns are heavily weighted in a certain direction. But this is what LTA gives us so we have to deal with it.

As usual, source code and generated files are available here. Don’t worry, all my prototyping was done in the Azure Notebooks copy of the repo.

What do I think?

I didn’t really write this as a full guide on how to setup a Notebook with a VM or how to use Databricks, but just as a personal reflection on what among the dizzying array of offerings on Azure that I’ve found suits my projects better.

Personally, as a hobbyist, I’d recommend using Notebooks. Databricks costs a pretty penny, and the language barrier means that your code may not make it through fully intact. While the free tier of Notebooks can be pretty gimped, the ability to attach a Data Science VM, should you be able to pay for one, also helps to make Notebooks far more extensible for future production workloads.

Of course, with a DSVM, you should remember to turn it off when you’re not using it.

Full disclosure: As a Microsoft Student Partner, I receive $150 of monthly credit for Microsoft Azure. That credit pays for the Azure resources I’m using to work on this project.
But you can do (almost) everything I did with $100 of free Azure credit (and more), if you’re a student with a valid school email address. More details here.