Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Well done!
You have completed Introduction to Big Data!
You have completed Introduction to Big Data!
Preview
How Does Netflix Apply Big Data Tools to Solve these Problems?
3:59 with Craig Dennis and Jared SmithLet's take a look at how Netflix solves their issues.
Terms
New Terms:
- Docker -- Docker is an open-source project that automates the deployment of applications inside software containers, and runs on top of Linux, Windows, Mac, and other operating systems.
- Anomaly Detection -- Anomaly detection is the process of looking for abnormalities in data to discover potentially interesting insights, ranging from security incidents to service failures.
- Node.js -- Node.js is a runtime for the JavaScript language that allows it to be run on the server independently of a browser, and is widely used across many companies and projects as way to use JavaScript as a general purpose and widely deployable language.
Learn More
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
Netflix applies many of the tools and
0:00
frameworks to the domains of big
data that we have discussed,
0:02
to solve the problems that they have
at operating on such a massive scale.
0:05
From utilizing processing frameworks, to
storage systems, to infrastructure tools,
0:09
Netflix exemplifies many of the actual
use cases of big data tools.
0:14
So let's dive into three particular
examples of how Netflix applies tools that
0:18
we've discussed, to solve their problems.
0:22
So let's take a look at the Data Pipeline.
0:25
Netflix ingest data using Apache Kafka.
0:28
It has two sets of Kafka clusters.
0:31
The first cluster is at the front end, and
0:34
gets its messages from clients
that produce messages.
0:36
Which is basically every instance
of the Netflix application.
0:39
It sends those messages to the back end
services, using the second Kafka cluster.
0:43
Which consumes messages, and
sends them to services, like Apache,
0:48
Spark, and Hadoop, for processing and
other Netflix custom application.
0:51
The front end Kafka instance also
passes data to storage layers,
0:56
such as Amazon S3, and
to the search tool, Elasticsearch.
1:01
This is used for both storing data in
short and long term storage on Amazon web
1:05
services, as well as for searching
across all events using Elasticsearch.
1:10
To ensure resiliency
of the Kafka clusters,
1:15
there is a replication factor of 2 for
all the messages in the system.
1:18
Which means that they will
always have a backup message for
1:22
every message in the system.
1:25
Messages are retained
in the Kafka system for
1:27
up to 24 hours, even after being sent
downstream or to storage layers.
1:29
In total, Netflix operates 36 Kafka
clusters, which consume an average
1:34
of 700 billion messages per day, and
up to 8 million events per second.
1:40
Netflix uses machine learning
to produce recommendations for
1:46
each of their users based on what
videos they have watched and liked.
1:49
So let's look at the problem of
serving those recommendations.
1:53
Each of the Netflix
clients produce messages,
1:57
which are adjusted by a Kafka cluster,
and then sent to the back end for
2:00
processing offline with Apache Hadoop,
and Amazon web services.
2:04
Netflix also uses Apache Spark,
with the Spark's streaming extension,
2:09
as well as their own
version of Apache Storm,
2:13
to produce new recommendations
with sub minute latency.
2:16
When their systems need to retrieve
prior results from the machine learning
2:20
algorithms, the back end fetches
the results from a combination of MySQL,
2:24
Cassandra, and Netflix's own caching
engine, which they have also open sourced,
2:28
EVCache, which stands for
Ephemeral Volatile Cache.
2:34
In order to run these services
at such a massive scale,
2:39
Netflix utilizes various other
infrastructure tools and frameworks.
2:42
A major open source infrastructure tool
in heavy use as Netflix is Apache Mesos.
2:46
Mesos is a resource management
service that can be used to
2:51
run clusters of machines,
schedule jobs and services, and much more.
2:54
It runs thousands of containers and
2:59
services at Netflix that help to keep the
back end and user facing services online.
3:01
Now, specifically, it runs a mix of Batch,
Stream Processing, and
3:06
Service Style Workloads.
3:11
Their use cases for
MESOS include Real Time Anomaly Detection,
3:12
Training and Model Building Batch Jobs,
3:17
Machine Learning Orchestration,
and Node.js Based Microservices.
3:19
Netflix has built several internal big
data tools for resource management.
3:24
Some examples are Mantis, which is
a reactive stream processing engine,
3:29
which processes up to 8
million events per second.
3:33
Titus is a docker container job
management and execution platform.
3:36
Meson is a general purpose workflow
orchestration & scheduling framework
3:40
that powers hundreds of concurrent
Machine Learning pipelines.
3:45
Netflix uses a wide range of big data
tools across their entire stack, and,
3:49
therefore, are such a great example
of how far an organization can
3:53
take the things that we've
learned about in this course.
3:56
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up