Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Well done!
You have completed Introduction to Big Data!
You have completed Introduction to Big Data!
Preview
How do you store all this data?
Learn More
- SQL Basics Course
- MongoDB Basics
- PostgreSQL
- MySQL
- HDFS Overview
- Cassandra
-
Amazon S3 at Dropbox
- They have since moved away (only as of 2016), but it is still a great use case for Amazon S3.
- Free Graph Databases e-book from OβReilly
- Neo4j
- Market Survey of Graph Databases/Analytics Platforms
-
GraphDatabases Neoj4 2.0 examples
-
Discover Graph Databases with Neo4j and PHP
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
In order to work with data, you often
need to store it in some medium before or
0:00
after processing it.
0:04
For this reason, data storage tools and
frameworks make up a large part of
0:05
the major tools and
frameworks in the big data ecosystem.
0:09
We'll be taking a look at these three
major classes of data storage systems.
0:13
Relational databases are used to store
structured data like we just talked about.
0:17
These databases store the structured data
according to what is known as a schema.
0:21
A schema defines the way the data
is organized in the system,
0:26
also known as its structure.
0:30
Relational databases use a query
language that can access these schemas,
0:32
most typically a dialect of SQL, which
stands for structured query language.
0:37
SQL provides a standard way to query,
manipulate, and
0:42
store data in relational databases.
0:45
Check the teacher's notes if you're
looking to learn more about SQL.
0:48
Relational databases perform very well for
data that is not sparse.
0:52
Sparsity in a database is defined
by the amount of blank entries.
0:57
In the case of relational databases,
the less sparse the data, the better.
1:01
It also does well with data that can
be contained on a single machine.
1:06
It becomes less appropriate when data is
needed to be spread across many machines
1:10
and accessed in parallel.
1:14
There are databases built specifically for
1:16
these highly distributed purposes and
we'll cover those here shortly.
1:18
A few of the major relational
databases you've probably heard
1:22
of are PostgreSQL, MySQL,
MS SQL, and MariaDB.
1:27
Non-relational databases are often based
around documents, which you can think
1:31
of as a piece of data that doesn't have
a predefined schema, or structure.
1:35
Now these documents could be JSON, which
stands for JavaScript Object Notation,
1:40
XML, or just plain old text blobs.
1:45
Non-relational databases, or NoSQL,
perform better when the data needs
1:47
to be distributed or
shared across many machines.
1:52
It opens up the possibility for
having everything accessed in parallel
1:57
with the ability to read and
write in parallel across a cluster.
2:00
One of the most popular NoSQL databases
is MongoDB, a document-based NoSQL
2:05
database that stores data in BSON,
which is a binary format.
2:10
And then clients retrieve
the results in the form of JSON.
2:14
Remember that there's often more data
than can fit on a single computer.
2:18
When you need to scale the number of
machines where you're storing your data
2:22
to potentially thousands or more, you
have to use specialized storage systems.
2:26
These systems have the ability to
scale and have been battle tested and
2:31
can now store up to petabytes of data.
2:34
A few of the most popular storage
engines for large distributed data sets
2:37
are the Hadoop Distributed File System,
or HDFS, and Cassandra.
2:41
These are used for unstructured or
structured text data.
2:47
Amazon's Simple Storage Service,
more commonly referred to as Amazon S3,
2:51
is used to store files of nearly any size.
2:55
Hadoop was originally built by Google to
index the entire web, like all of it.
2:59
Cassandra is a system used by Facebook
to power a large part of their systems.
3:04
Amazon S3 is used by Dropbox and
many others for
3:09
storing files across many
regions of the world.
3:12
And last but not least,
we should discuss graph based databases.
3:15
Graph databases store data that can be
represented by nodes and edges, where
3:20
a node could be a person and an edge could
be a property that the two nodes share.
3:24
They help search and
walk relationships, and
3:30
find patterns in
the interconnectivity between nodes.
3:32
The canonical example of a good use case
for a graph database is a social network.
3:36
It's important to keep in mind that you
don't wanna just use a graph database
3:41
just for
the sake of using a graph database.
3:45
It sounds cool, but often,
the normal SQL database will do the trick.
3:47
If you do find this is a good choice for
your data, Neo4j and Dgraph
3:52
are two very popular graph databases
that are open source and widely used.
3:56
Now that we've taken a brief overview
of the domain of data storage,
4:01
let's start looking at our next domain,
computation.
4:04
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up