As one of our main missions to improve the quality of trips for each passenger and driver who spend too many time in traffic every day, we need to know all about trips such as trip duration, trip detour value, unexpected/expected problems on the traffic. For that reason, we have started to collect and analyze all the data that are related to trips.
In this series of posts, I will introduce our Data Pipeline in a detailed way: which technologies we are using, how they are being used and why we chose them. Of course, first thing first, in this post, I will show the general architecture of infrastructure briefly.
Almost every day, the amount of data we are processing is increasing, as a result of the rise of the numbers of our operations and our technical improvements. In terms of being secure, reliable, fault-tolerant, scalable, sustainable and efficient, we have chosen a Big Data Architecture that can handle the data that have volumes in Terabytes/Petabytes for each day.
Like driving in traffic, we need to make decisions and predictions in real-time. We have real-time processing algorithms running to make our trips run smoother each day than the previous day. Also, various tasks are running for every day to monitor daily trip routines and report anomalies and predict the changes in the trips over time.
Let’s talk about technical stuff…
There are several technologies that we can use to load the data from Amazon Kinesis and process using tools such as AWS Lambda Functions, Kinesis Firehose, Apache Flink and Apache Storm. But we had chosen Apache Spark. The main reasons to choose Spark are: it supports multiple languages, it has a wide community and it is appropriate for real-time processing, batch processing, and machine learning.
We can say Apache Spark is the main component for our data pipeline, we use it for multiple purposes;
1. Load the events from Amazon Kinesis then save them into a database to use them later,
2. CEP (Complex Event Processing),
3. Create periodic reports,
4. Create routes and better routes,
5. Make changes on routes in real-time to handle unexpected situations.
Actually, it is enough for Apache Spark, I need to save more for the next posts 🙂
Of course, Apache Cassandra (C*). C* is one of the write-heavy, NoSQL databases and we had lots of reasons to choose it:
1. Easy to run, to create a Cassandra cluster, there is no need to have any external application or technology,
2. Masterless architecture,
3. Supported by GeoMesa, a tool for large-scale geospatial analytics,
4. Works very well with Apache Spark,
5. Effortlessly adding extra capacity when we need to/ want to.
We run Apache Spark and Cassandra on Mesos as it is easy to maintain and proposed in SMACK stack. SMACK is the acronym for Spark, Mesos, Akka, Cassandra and Kafka. It abstracts CPU, memory, storage, and other computational resources away from physical or virtual machines and makes distributed systems fault-tolerant and elastic. It runs applications within its cluster and also provides a highly available platform.
It is the open source Python library of Spotify that make possible to create complex workflows. We have lots of reports and calculations some of which depends on others. To make clear and manageable, we run a Luigi daemon.