Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Well done!
You have completed Introduction to Big Data!
You have completed Introduction to Big Data!
Preview
How do you keep data flowing and scale?
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
The final important domain of big data
that we'll take a look at here is
0:00
infrastructure.
0:04
Infrastructure in the world of big
data allows the data to keep flowing.
0:05
And also allows the systems that we
have discussed to run at scale and
0:10
on large data sets.
0:13
The fundamental unit in
big data infrastructure is
0:15
often a cluster of machines.
0:18
Now typically, this is a group
of networked Linux servers.
0:20
Managing clusters of machines
is a non-trivial task.
0:24
You can't just write your own
homegrown software to manage
0:27
all the servers you have available.
0:30
You need to have the ability to run your
processing code across the cluster and
0:32
then gather the results to
display them to the client.
0:37
The great news is that there
are awesome cluster management tools.
0:41
A popular cluster manager is Apache Mesos.
0:45
Mesos is used by companies like Airbnb,
Apple, Cisco Systems, Netflix, and Uber.
0:48
Another cluster manager that you're
likely to hear about is Kubernetes.
0:54
Kubernetes is more often used for managing
containers and not virtual machines.
0:58
More in the teacher's notes.
1:03
Another area of infrastructure
that is vastly important
1:05
is the layer of messaging
done between services.
1:08
Now, this includes sending data
between the various storage layers,
1:11
computation engines and
other infrastructure pieces.
1:15
We need systems that can handle
the robust transportation of messages.
1:18
Because our normal tools,
like simple Unix pipes or TCP connections,
1:23
just cannot do the trick for
very large amounts of streaming data.
1:27
One of the most widely used tools
to handle this messaging dilemma is
1:31
Apache Kafka.
1:35
Kafka ensures that you are always able
to keep your data around to be ingested
1:37
by general computation engines or
storage layers.
1:42
It also allows for
1:45
historical playback of data that has
already been streamed through the system.
1:46
Kafka is typically placed
between clients and
1:50
the back end servers that run general
computation engines and databases.
1:52
There are many other infrastructure
services that exist, but
1:56
they are out of the scope of this course.
2:00
You may also wanna look into visualizing
your data with tools like D3.js.
2:02
Or maybe you want to manage your state and
configuration across many machines using
2:07
tools like Apache ZooKeeper or
HashiCorp's Consul.
2:11
Data serialization for
faster transfer of data
2:15
can be performed using tools
like Apache Thrift and Parquet.
2:18
Now that we've touched on the three
major domains of big data; storage,
2:22
computation and infrastructure.
2:26
We're ready for the final stage of this
course where we'll look at specific
2:27
problems that a well known company,
Netflix has encountered,
2:31
and how they are solving
them using big data.
2:35
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up