A serverless web app with Functions and github pages
Or at least, that’s the plan. But let’s see how much of a Rube Goldberg machine we can make this, for science.
If you spend plenty of time on Hardwarezone or other Singaporean internet forums, you’ll come away with the impression that the government can be notoriously opaque at times. But with all that buzz around the Smart Nation Initiative and whatnot, you now have things like data.gov.sg and open data programs. Naturally the LTA is no exception.
So while browsing the LTA Datamall, looking for some information for a school project a while back, I came across something rather interesting:
Oh, looks like these data could be useful for my writing project, From the Red Line.
If you want to find out what I ended up doing with all these, you can jump here. But if you want to read a play-by-play of my fumbling with things that probably are beyond my comprehension, read on.
Where to begin?
I quickly pulled open the documentation to find out. In short, LTA has somehow decided to open up the aggregated averages of bus/train network usage per hour. It seems to be the same dataset behind this page, as well as those new posters they’ve put up in stations. Govtech also did something similar with buses, but it appears that it isn’t so straightforward with trains, since unlike buses where you have to tap off and on again, you can (mostly) change train lines without passing a faregate.
After a bit of monkeying around in MS Excel and Python scripts, I think I’ve gotten myself something that remotely resembles a useable dataset, which I’ve made available here.
But well, that’s not enough. I’m already drowning in numbers, and if you open it yourself you’d be confused too. And we haven’t even counted the origin-destination pair files, which are easily a hundred times bigger (opening them in Excel caused my computer to start breathing fire, and the github CSV reader is clearly unfit for purpose here). So what do we do?
Of course we kill ants with a sledgehammer
Let’s go with a web-based frontend to identify how busy each stop (and route) is. Something like what these guys did, but we’ll need to be able to play around with stuff because of the sheer amount of data we have, and no time yet for fun machine learning things. There are probably far more better tools available to the professionals (like that bus usage simulator Govtech built), but this should be enough for a hobbyist blog. At least we can learn something.
The backend… well, this is where things get interesting. I initially experimented with using SQL databases, but that didn’t really work out for performance reasons. Another factor is that I have CSVs on github and I don’t want to move them as much as possible, nor do I have a server to actually put them on, along with PHP code (VMs are expensive, okay!)
So, in the name of science, let’s give Azure functions a try. The most typical use of functions, and other on-demand snippets of code like it, would be for something like the AWS IoT button, where you press it and shit happens on the internet, maybe a light will blink. But let’s try doing that in a RESTful manner, where we throw some arguments at it, the script goes out to the Github repo and digests the CSV files, and it vomits out numbers we can visualize. As a graph? On a map? Well, I guess that depends on what we do.
The whole request handling system is easy enough, when implemented as a function. Get arguments, retrieve database, perform filtering, dump to JSON. However, naturally first we have to create and run the thing.
To make sure that our filtering instructions work as expected, we can use a Jupyter notebook, which allows you to see the result of instructions immediately after you perform them. While you can already do that in the python shell, Jupyter is much easier.
Since I mainly develop on weaker workstations which don’t really have the firepower to run a full anaconda-esque workload, I used Azure Notebooks instead.
Once confident that everything works as it should, I then put it all into a function (func init) and then upload the “server” functions to Azure with func azure functionapp publish. Microsoft’s guides have always been good on the matter.
The first draft, with a github pages-hosted client turned out to have pretty trashy performance (requests could take around a minute!)
Later improvements managed to cut down the response time to something satisfactory, but what else can we still do? I went and did some research.
It’s important to remember that when using a consumption plan, the first run of a function after a specified timeout will take a while, because the function has to first load itself onto the computing layer. This probably means that for anything beyond experiments like these, you’d probably want to use something with a lot more iron, such as the premium plan which keeps at least one instance of the function app on hot standby, idling and ready to go when you need it.
I also considered migrating my CSVs to Azure Cosmos DB, but as mentioned, CosmosDBs are expensive and cost is a factor. However, the cost probably can be worth it if things ever get more serious, since after all they’re hosted in the same datacentre, and the things that CosmosDB promise sound really good, if I have to admit. (Indexing, scalable throughput, I know you’re probably thinking of trying to DDoS my little web app here, but oh well…)
Let’s see things in action
The web link is available here. Please don’t abuse it. (I know it looks bad, HTML design isn’t my strong suit, make all the “Graphic Design Is My Passion” memes you want)
This should show you the average amount of trips, between a station pair, in a certain given month. Transfer passengers within interchange stations are NOT counted, so this doesn’t show how passengers got between stations, only that the trip was made(I’ve asked LTA about Newton, Tampines and Bukit Panjang, but did not receive an answer).
Due to some laziness in writing the processing scripts, I’ve also left out the extra suffixes at some interchange stations (such as NE12/CC13 and EW12/DT14 in the charts). The replaced codes can be viewed here. Note that the raw data from LTA indicate some interchange stations as separate complexes, such as at Serangoon (NE12/CC13). Bayfront is also interesting that while the Circle and Downtown lines share the same building, gates on the exit A/E side are recorded for the Downtown line, while gates on the B/C/D side record to the Circle line.
I have to note that these datasets are clearly incomplete since they show only entries and exits into the network. As mentioned, transfer passengers aren’t counted in these data, and there’s no way to piece together a journey based on bus-MRT-bus transfers or other fun stuff (but then again LTA puts guys in stations to manually count people waiting for trains…). But they should be good enough for what we want to do on FTRL.
The bus data exists, and I have it, but it’ll take some work with both the preliminary processors and the actual functions before this app will work with it. A good experiment anyway, and probably what I’ll do next.
Full disclosure: As a Microsoft Student Partner, I receive $150 of monthly credit for Microsoft Azure. That credit pays for the Azure resources I’m using to work on this project.
But you can do everything I did with $100 of free Azure credit (and more), if you’re a student with a valid school email address. More details here.