AI can only take you so far. 🌟 Start with core skills in JavaScript, HTML, CSS, or Python. 🚀

Join the Treehouse affiliate program and earn 25% recurring commission!

New No-Code Track! 🚀start learning today!

🌟 Dreaming of a bright future? 🎓 Ask about the Treehouse Scholarship program! 🚀

✨ Earn college credits in Cybersecurity, JS, HTML, CSS and Python

Well done!

You have completed Introduction to Big Data!

Sign up for Treehouse Back to Library

Preview

Sign up for Treehouse Continue

How Does Netflix Apply Big Data Tools to Solve these Problems?

3:59 with Craig Dennis and Jared Smith

Let's take a look at how Netflix solves their issues.

Teacher's Notes
Questions?
Video Transcript
Downloads
Workspaces

Terms

New Terms:

Docker -- Docker is an open-source project that automates the deployment of applications inside software containers, and runs on top of Linux, Windows, Mac, and other operating systems.
Anomaly Detection -- Anomaly detection is the process of looking for abnormalities in data to discover potentially interesting insights, ranging from security incidents to service failures.
Node.js -- Node.js is a runtime for the JavaScript language that allows it to be run on the server independently of a browser, and is widely used across many companies and projects as way to use JavaScript as a general purpose and widely deployable language.

Learn More

Related Discussions

Have questions about this video? Start a discussion with the community and Treehouse staff.

Sign up

Related Discussions

Have questions about this video? Start a discussion with the community and Treehouse staff.

Sign up

Netflix applies many of the tools and 0:00

frameworks to the domains of big data that we have discussed, 0:02

to solve the problems that they have at operating on such a massive scale. 0:05

From utilizing processing frameworks, to storage systems, to infrastructure tools, 0:09

Netflix exemplifies many of the actual use cases of big data tools. 0:14

So let's dive into three particular examples of how Netflix applies tools that 0:18

we've discussed, to solve their problems. 0:22

So let's take a look at the Data Pipeline. 0:25

Netflix ingest data using Apache Kafka. 0:28

It has two sets of Kafka clusters. 0:31

The first cluster is at the front end, and 0:34

gets its messages from clients that produce messages. 0:36

Which is basically every instance of the Netflix application. 0:39

It sends those messages to the back end services, using the second Kafka cluster. 0:43

Which consumes messages, and sends them to services, like Apache, 0:48

Spark, and Hadoop, for processing and other Netflix custom application. 0:51

The front end Kafka instance also passes data to storage layers, 0:56

such as Amazon S3, and to the search tool, Elasticsearch. 1:01

This is used for both storing data in short and long term storage on Amazon web 1:05

services, as well as for searching across all events using Elasticsearch. 1:10

To ensure resiliency of the Kafka clusters, 1:15

there is a replication factor of 2 for all the messages in the system. 1:18

Which means that they will always have a backup message for 1:22

every message in the system. 1:25

Messages are retained in the Kafka system for 1:27

up to 24 hours, even after being sent downstream or to storage layers. 1:29

In total, Netflix operates 36 Kafka clusters, which consume an average 1:34

of 700 billion messages per day, and up to 8 million events per second. 1:40

Netflix uses machine learning to produce recommendations for 1:46

each of their users based on what videos they have watched and liked. 1:49

So let's look at the problem of serving those recommendations. 1:53

Each of the Netflix clients produce messages, 1:57

which are adjusted by a Kafka cluster, and then sent to the back end for 2:00

processing offline with Apache Hadoop, and Amazon web services. 2:04

Netflix also uses Apache Spark, with the Spark's streaming extension, 2:09

as well as their own version of Apache Storm, 2:13

to produce new recommendations with sub minute latency. 2:16

When their systems need to retrieve prior results from the machine learning 2:20

algorithms, the back end fetches the results from a combination of MySQL, 2:24

Cassandra, and Netflix's own caching engine, which they have also open sourced, 2:28

EVCache, which stands for Ephemeral Volatile Cache. 2:34

In order to run these services at such a massive scale, 2:39

Netflix utilizes various other infrastructure tools and frameworks. 2:42

A major open source infrastructure tool in heavy use as Netflix is Apache Mesos. 2:46

Mesos is a resource management service that can be used to 2:51

run clusters of machines, schedule jobs and services, and much more. 2:54

It runs thousands of containers and 2:59

services at Netflix that help to keep the back end and user facing services online. 3:01

Now, specifically, it runs a mix of Batch, Stream Processing, and 3:06

Service Style Workloads. 3:11

Their use cases for MESOS include Real Time Anomaly Detection, 3:12

Training and Model Building Batch Jobs, 3:17

Machine Learning Orchestration, and Node.js Based Microservices. 3:19

Netflix has built several internal big data tools for resource management. 3:24

Some examples are Mantis, which is a reactive stream processing engine, 3:29

which processes up to 8 million events per second. 3:33

Titus is a docker container job management and execution platform. 3:36

Meson is a general purpose workflow orchestration & scheduling framework 3:40

that powers hundreds of concurrent Machine Learning pipelines. 3:45

Netflix uses a wide range of big data tools across their entire stack, and, 3:49

therefore, are such a great example of how far an organization can 3:53

take the things that we've learned about in this course. 3:56

You need to sign up for Treehouse in order to download course files.

Sign up

You need to sign up for Treehouse in order to set up Workspace

Sign up