Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Well done!
You have completed Data Analysis Basics!
You have completed Data Analysis Basics!
Preview
It's not always so easy to collect good data. In this video we'll look at a few common issues with data collection and talk about how we could handle them.
This video doesn't have any notes.
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
We've got our data and we're ready to
start uncovering some hidden truths.
0:00
But first,
before we start analyzing anything,
0:04
it's usually a good idea to think
about where the data comes from.
0:07
The data we receive isn't always
guaranteed to be 100% accurate.
0:11
When analyzing the accuracy of your data,
you need to think about a few things.
0:16
Including the source of the information,
the methods for
0:20
collecting the data and
the way the data is measured.
0:23
Now, for this example,
our data comes from a pretty good source.
0:26
The Boston Athletic Association
keeps detailed records, and
0:30
they let you search through the results.
0:33
However, they don't just let
you download all the results.
0:36
That would be way too easy.
0:39
Instead, we rely on other data
scientists to write code to
0:41
repeatedly search through the results and
collect them all into a CSV file.
0:45
This is the first bit of messiness
that we need to be aware of.
0:50
Our data doesn't come
directly from the source.
0:53
We're relying on somebody else to collect
the data without making any mistakes.
0:56
So if we see anything strange,
1:01
like somebody running the whole marathon
in less than an hour, [SOUND] we'd want to
1:02
double check that with the official
results before continuing our analysis.
1:07
Or more likely, finding a new data source.
1:11
Another example of messy data
would be survey results.
1:14
Unlike the Boston Marathon where the
runner's times are automatically recorded.
1:18
With surveys,
1:22
we have to deal with the possibility
that respondents are lying to us.
1:23
Or more likely they're being
influenced by a cognitive bias.
1:27
One such cognitive bias is
the social desirability bias.
1:31
When answering a survey,
respondents have a tendency to
1:35
respond in a way that will
make them look good to others.
1:38
For example,
1:41
when the dentist asks how frequently
you floss, do you tell the truth?
1:42
Turns out, one in four of you don't.
1:47
Which leaves this awesome
headline by N P R.
1:49
Are you flossing, or
just lying about flossing?
1:52
But the social desirability bias doesn't
only mean over-reporting good behaviors.
1:56
It can also mean
under-reporting bad behaviors.
2:00
If a survey asks how frequently you use
recreational drugs or how many sexual
2:04
partners you had, it's pretty unlikely
that everyone's going to tell the truth.
2:08
In fact, getting accurate data
about recreational drug use is so
2:13
hard that scientists have turned
to testing waste water to try and
2:17
figure out which substances
are being used in a community.
2:21
Luckily we don't have to go quite that
far to make sure we have usable data.
2:25
Though just because our data
is automatically recorded,
2:30
doesn't mean that it's accurate.
2:33
Take, for example, the step counter
on your phone or smart watch.
2:35
I don't know about you, but
mine's not particularly accurate.
2:38
Sometimes I get in the car to go to work,
and
2:42
by the time I've arrived,
I've added another hundred steps.
2:44
Now for something like step counting,
2:48
it's probably not a big
deal to have a few extra.
2:50
But what if we had a step
competition with people using
2:53
all kinds of different devices
to record their steps?
2:56
All of a sudden, those extra
steps would matter a whole lot.
2:59
What if somebody else had a much
more accurate step counter?
3:04
And on that same drive,
3:07
instead of recording an extra hundred
steps, it's perfectly accurate.
3:08
It wouldn't really be fair to
compare our steps directly.
3:13
So instead, before any analysis, we would
want to correct for any extra steps.
3:16
This process is known as cleaning or
preparing your data.
3:22
Before doing an analysis, you want to
make sure that your data is valid.
3:25
This can be as simple as combining
several misspellings of a name
3:29
into one category or
3:32
as difficult as trying to figure out which
responses are genuine in an online survey.
3:34
Lucky for us,
our data is already pretty clean.
3:40
But we do have a few unused columns.
3:43
The first column seems to be
an unnecessary line number.
3:46
Column J is just completely empty, and
3:49
the Projected Time column is empty for
all runners.
3:52
We can safely delete each of these columns
by right-clicking on the column header and
3:55
selecting Delete column.
4:00
Data cleaning and preparing is the real
unsung hero of data analytics.
4:02
It takes a lot of work to turn
raw information into good,
4:07
valid data, that we can analyze.
4:10
And, I can't stress enough
how important it is to
4:12
understand where your data comes from.
4:15
That's enough about
dealing with messy data.
4:18
Coming up,
we'll finally start doing some analysis.
4:20
And by the end of this course,
4:23
you'll be ready to draw insights
from all the data all around you.
4:24
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up