Share on facebook
Share on google
Share on twitter
Share on linkedin

As one of our main missions to improve the quality of trips for each passenger and driver who spend too many time in traffic every day, we need to know all about trips such as trip duration, trip detour value, unexpected/expected problems on the traffic. For that reason, we have started to collect and analyze all the data that are related to trips.

In this series of posts, I will introduce our Data Pipeline in a detailed way: which technologies we are using, how they are being used and why we chose them. Of course, first thing first, in this post, I will show the general architecture of infrastructure briefly.

Almost every day, the amount of data we are processing is increasing, as a result of the rise of the numbers of our operations and our technical improvements. In terms of being secure, reliable, fault-tolerant, scalable, sustainable and efficient, we have chosen a Big Data Architecture that can handle the data that have volumes in Terabytes/Petabytes for each day.

Like driving in traffic, we need to make decisions and predictions in real-time. We have real-time processing algorithms running to make our trips run smoother each day than the previous day. Also, various tasks are running for every day to monitor daily trip routines and report anomalies and predict the changes in the trips over time. 

Let’s talk about technical stuff…

As Volt Lines, we need to choose one of the proven and stable architectures of the Big Data World and apply it to our conditions. According to our knowledge, experience, own needs and of course as a result of tests and experiments, we decided to build our Big Data Environment as one of the variations of SMACK Stack. As a growing startup, we need to be as flexible as possible in the architecture to handle the new challenges we face every day.

Our stack includes many components like Amazon Kinesis, Apache Cassandra, Apache Spark, Apache Mesos, Marathon, Chronos, Amazon S3, Amazon DynamoDB, Fluentd, Luigi. For this post, I am going to introduce only the main components, Amazon Kinesis, Apache Cassandra, Apache Mesos, Apache Spark and Luigi…

Ensure you have the data, Amazon Kinesis!!!
To make it more simple, forget about APIs and event types and suppose that there is an engine that generates events in JSON format. To process the events sequentially, we are sending them to a messaging queue, in our case, Kinesis, one of the services that AWS offers. After feeding the data to Kinesis, we have time to read and process the events in one day (with our configuration) from it. Kinesis provides high-throughput, low latency which makes our data safe and fast. And also we can read the data from it multiple times in real-time.
Volt Lines Data Pipeline Architecture
General View of Kinesis Streams and Shards
Now it is Apache Spark time

There are several technologies that we can use to load the data from Amazon Kinesis and process using tools such as AWS Lambda Functions, Kinesis Firehose, Apache Flink and Apache Storm. But we had chosen Apache Spark. The main reasons to choose Spark are: it supports multiple languages, it has a wide community and it is appropriate for real-time processing, batch processing, and machine learning.

We can say Apache Spark is the main component for our data pipeline, we use it for multiple purposes;
1. Load the events from Amazon Kinesis then save them into a database to use them later,
2. CEP (Complex Event Processing),
3. Create periodic reports,
4. Create routes and better routes,
5. Make changes on routes in real-time to handle unexpected situations.

Actually, it is enough for Apache Spark, I need to save more for the next posts 🙂

So where to save our data

Of course, Apache Cassandra (C*). C* is one of the write-heavy, NoSQL databases and we had lots of reasons to choose it:
1. Easy to run, to create a Cassandra cluster, there is no need to have any external application or technology,
2. Masterless architecture,
3. Supported by GeoMesa, a tool for large-scale geospatial analytics,
4. Works very well with Apache Spark,
5. Effortlessly adding extra capacity when we need to/ want to.

Distributed Resource Management, Apache Mesos

We run Apache Spark and Cassandra on Mesos as it is easy to maintain and proposed in SMACK stack. SMACK is the acronym for Spark, Mesos, Akka, Cassandra and Kafka. It abstracts CPU, memory, storage, and other computational resources away from physical or virtual machines and makes distributed systems fault-tolerant and elastic. It runs applications within its cluster and also provides a highly available platform.

We also have a workflow manager, Luigi,

It is the open source Python library of Spotify that make possible to create complex workflows. We have lots of reports and calculations some of which depends on others. To make clear and manageable, we run a Luigi daemon.

We also use some other tools for analyzing and modeling our data, such as Elastic Search, Jupyter Notebook, GeoMesa, some AWS services and Machine Learning libraries from Python.
. . .
In this post, I tried to introduce tools that we choose to create a Data Pipeline. As it is the first post of the series, I haven’t referred to any numbers about how good our system is :). I am going to write about the performance metrics of our architecture in the next post.

Benzer Makaleler