Scenario: You're an analyst for The Economist.
Your editor pairs you up with a journalist to
write a piece about the perception of corruption
and its relationship to the human development
index. It is your job to inform the analytic
direction of the article and to provide the data
visualizations that will accompany the article.
Statistics: Unlocking the Power of Data Lock5 What Can We Do?
Which questions would you like to answer based off of the
data you have seen? What Information is Available? As the analyst What can you all tell me about the data below? ⚫
⚫ CPI - corruption perception index How do you begin analyzing the data?
Which questions can be answered data?
How would you answer those questions?
How do you know your results are valid and meaningful?
How would you explain the results to:
HDI - Human Development Index HDI.Rank - Low HDI implies Low Rank in the HDI A technical peer? A non-technical stakeholder?
Examples
development index and the corruption
perception index? - Hypothesis Testing
- Modeling Does HDI vary by region?
- In which regions does it vary?
- How much does it vary?
Statistics: Unlocking the Power of Data Lock5 Do the Results Mean Anything?
- What is a p-value and what does it mean?
- Are my standard errors accurate?
Course Expectations Visualizations and Intro to Modeling ⚫ Assess and explore a new dataset.
⚫ Determine appropriate questions.
⚫ Differentiate between the questions you want to answer and the questions
you can answer.
⚫ Determine which statistical methods can be applied to the available data.
⚫ Interpret the results of a given statistical method.
⚫ Understand the limitations of interpretation.
⚫ Explain your analysis and choices to others.
⚫ Gain data literacy skills to help you in the future.
Outline Section 1.1 Why Statistics? ⚫ Data ⚫ ⚫ Cases and variables The Structure of Data ⚫ Categorical and quantitative variables ⚫ Explanatory and response variables
⚫ Explanatory and response variables Lock5 ⚫ Lock5 ⚫ Data are a set of measurements taken on a set of individual units
⚫ Usually data is stored and presented in a dataset, comprised of variables measured on
cases
Lock5 Intro Statistics Survey Data Collecting data ⚪ Describing data – summarizing, visualizing
Analyzing data Data are everywhere! Regardless of your field, interests, lifestyle,
etc., you will almost definitely have to make decisions based on
Cases and Variables Data ⚪ ⚪ ⚫ Using data to answer a question Statistics is all about data
Statistics: Unlocking the Power of Data Statistics is all about data Lock5 The Economist Data We obtain information about cases or units.
A variable is any characteristic that is recorded for each
case. ⚫ Generally each case makes up a row in a dataset, and each variable makes up a column
Statistics: Unlocking the Power of Data Lock5 Statistics: Unlocking the Power of Data Kidney Cancer Lock5 Kidney Cancer
If the values in the kidney cancer dataset are rates of kidney
cancer deaths, then what are the cases?
(a) The people living in the US (b) The counties of the US Counties with the highest kidney cancer death rates
Source: Gelman et. al. Bayesian Data Anaylsis, CRC Press, 2004. Statistics: Unlocking the Power of Data Lock5 Statistics: Unlocking the Power of Data Lock5 Statistics: Unlocking the Power of Data Lock5 Kidney Cancer
If the values in the kidney cancer dataset are rates of kidney
cancer deaths, then what are the cases?
(a) The people living in the US (b) The counties of the US Kidney Cancer Kidney Cancer If the values in the kidney cancer dataset are yes/no, then what are
the cases? If the values in the kidney cancer dataset are yes/no, then what are
the cases? (a) The people living in the US (a) The people living in the US (b) The counties of the US (b) The counties of the US A person either has kidney cancer or doesn’t… a rate must apply to
a group of people, such as a county
Statistics: Unlocking the Power of Data Lock5 A person either has kidney cancer or doesn’t. Yes/no
doesn’t make sense for a county.
Statistics: Unlocking the Power of Data Categorical versus Quantitative
⚫ Variables are classified as either categorical Lock5 Variables
⚪ Hollywood Movies What are the variables?
Is each variable categorical or quantitative? In a dataset to answer this question, what are the cases? 2. Can eating a yogurt a day cause you to lose weight? (a) Comedies (b) Dramas (c) Movies (d) Audience ratings • A categorical variable divides the cases into groups 3. Do males find females more attractive if they wear red? • A quantitative variable measures a numerical
quantity for each case 4. Does louder music cause people to drink more beer? 5. Are lions more likely to attack after a full moon? Lock5 Statistics: Unlocking the Power of Data Hollywood Movies
Do movies that are comedies tend to get higher audience ratings than
movies that are dramas?
In a dataset to answer this question, what are the cases? Lock5 Statistics: Unlocking the Power of Data Hollywood Movies Hollywood Movies
Do movies that are comedies tend to get higher audience ratings than movies
that are dramas? In a dataset to answer this question, how many variables are there? In a dataset to answer this question, how many variables are there? Comedies (a) 1 (a) 1 (b) Dramas (b) 2 (b) 2 (c) Movies (c) 3 (c) 3 Audience ratings (d) 4 (d) 4 We are collecting data about movies, so the cases
are the movies.
Statistics: Unlocking the Power of Data Lock5 Lock5 Do movies that are comedies tend to get higher audience ratings than movies
that are dramas? (a) (d) Lock5 Do movies that are comedies tend to get higher audience ratings than
For each of the following situations: or quantitative:
There are two variables: Whether the movie is a comedy or a drama, and
Lock5 ⚫ Does meditation help reduce stress? ⚫ Does sugar consumption increase hyperactivity? Statistics: Unlocking the Power of Data What do you want to know?
We’ll do a class survey, collecting data you are interested in. ⚫ What do you want to know about your peers?
What are the variables? (one or two?) ⚫ Are they categorical or quantitative? ⚫ Is there an explanatory and response variable? Statistics: Unlocking the Power of Data Lock5 0
1 (c) 2 Statistics: Unlocking the Power of Data Variables Lock5 ⚪ Write a question to measure each variable of interest. Write questions
so the resulting data will be accurate and easy to analyze.
⚫ Quantitative variable? Give units. ⚫ Categorical variable? Give the possible categories (no more than 5). ⚫ Be clear and specific. Lock5 Which is the explanatory and which is the response variable? 2. Can eating a yogurt a day cause you to lose weight? 3. Do males find females more attractive if they wear red? 4. Does louder music cause people to drink more beer? 5. Are lions more likely to attack after a full moon? Statistics: Unlocking the Power of Data What do you want to know? Statistics: Unlocking the Power of Data Lock5 For each of the following situations: Audience rating is quantitative. Statistics: Unlocking the Power of Data ⚫ Sometimes we are interested in one variable,
Other times we are interested in the relationship between two variables If we are using one variable to help us understand or
predict values of another variable, we call the former the
explanatory variable and the latter the response
variable In a dataset to answer this question, how many of the variables are
quantitative?
(a) Lock5 (a)
(b) Explanatory and Response Hollywood Movies
Do movies that are comedies tend to get higher audience ratings
than movies that are dramas? Whether the movie is a comedy or a
drama is categorical. Lock5 Summary
⚫ Data are everywhere, and pertain to a wide variety of topics ⚫ A dataset is usually comprised of variables measured on cases ⚫ Variables are either categorical or quantitative ⚫ Data can be used to provide information about essentially
anything we are interested in and want to collect data on! Statistics: Unlocking the Power of Data Lock5 Outline Section 1.2 Sample versus Population ⚫ Sample versus Population Sampling from a
Population A population includes all individuals or objects of
interest. ⚫ Statistical Inference A sample is all the cases that we have collected data
on (a subset of the population). ⚫ Sampling Bias
⚫ Simple Random Sample Statistical inference is the process of using data from a
sample to gain information about the population. ⚫ Other Sources of Bias
The Big Picture Population Sampling Sample Statistical Inference
Population Sampling Sample
Statistical
Inference
Dewey Defeats Truman? Most Important to You Which of the following is most important to you? a) Athletics b) Academics c) Social Life d) Community Service e) Other
Which of the following is most important to
you?
a) Athletics
b) Academics
c) Social Life
d) Community Service
e) Other
Statistics: Unlocking the Power of Data ⚫ Suppose researchers studying student life use the results of our clicker question to investigate what
students find important ⚫ Can the sample data be generalized to make inferences about the population? Why or why not? Lock5 Dewey Defeats Truman? ⚫ However, Harry S. Truman won the election
⚫ What went wrong?
Statistics: Unlocking the Power of Data Most Important to You ⚫ What is the population? of the 1948 presidential election, and was
based on the results of a large telephone poll
Sampling Bias Sampling bias occurs when the method of selecting a sample causes the sample to differ from the population in some relevant way. ⚫ If sampling bias exists, we cannot trust generalizations from the sample to the population
Sampling bias occurs when the method of
selecting a sample causes the sample to differ
from the population in some relevant way. ⚫ If sampling bias exists, we cannot trust generalizations from the sample to the
population
Statistics: Unlocking the Power of Data Lock5 Can you avoid sampling bias? Sampling
Population Sample Sample ⚫ The next slide shows Lincoln’s Gettysburg Address. The entire
population, all words in his address, will be shown to you. What is
the average word length? ⚫ Your task: Select a sample of 10 words that resemble the overall
address. Write them down. ⚫ Calculate the average number of letters for the words in your sample ⚫ Enter your 10 random words into this sheet (paste in zoom chat) Lincoln’s Gettysburg Address
“Four score and seven years ago our fathers brought forth, on this continent, a new nation, conceived in
Liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great
civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We
are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final
resting place for those who here gave their lives that that nation might live. It is altogether fitting and
proper that we should do this. But, in a larger sense, we can not dedicate—we can not consecrate—we
can not hallow—this ground. The brave men, living and dead, who struggled here, have consecrated it, far
above our poor power to add or detract. The world will little note, nor long remember what we say here,
but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the
unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be
here dedicated to the great task remaining before us—that from these honored dead we take increased
devotion to that cause for which they here gave the last full measure of devotion—that we here highly
resolve that these dead shall not have died in vain—that this nation, under God, shall have a new birth of
freedom—and that government of the people, by the people, for the people, shall not perish from the
earth.” GOAL: Select a sample that is similar to the population, only smaller
Statistics: Unlocking the Power of Data Lock5 Statistics: Unlocking the Power of Data Can you avoid sampling bias?
⚫ Actual average?? ⚫ Imagine putting the names of all the units of the population
into a hat, and drawing out names at random to be in the
sample ⚫ More often, we use technology ⚫ We need a better way…
Lock5 Random vs Non-Random Sampling
⚫ Random samples have averages that are centered around
the correct number ⚫ Non-random samples may suffer from sampling bias, and
averages may not be centered around the correct number ⚫ Only random samples can truly be trusted when making
How can we make sure to avoid sampling bias? ⚫ Before the 2008 election, the Gallup Poll took a random sample of 2,847 Americans. 52% of those sampled supported Obama ⚫ In the actual election, 53% voted for Obama ⚫ Random sampling is a very powerful tool!!!
those sampled supported Obama
⚫ In the actual election, 53% voted for Obama ⚫ Random sampling is a very powerful tool!!!
Statistics: Unlocking the Power of Data Simple Random Sample
In a simple random sample, each unit of the
population has the same chance of being
selected, regardless of the other units
chosen for the sample
⚫ More complicated random sampling schemes exist, but will not be covered in this course
Random Sampling Take a RANDOM sample! ⚫ People are TERRIBLE at selecting a good sample, even when explicitly trying to avoid sampling bias!
⚫ sample, even when explicitly trying to avoid
sampling bias! Lock5 Lock5 Realities of Sampling
⚫ While a random sample is ideal, often it isn’t feasible. A list of the
entire population may not be available, or it may be impossible or
too difficult to contact all members of the population. ⚫ Sometimes, your population of interest has to be altered to
something more feasible to sample from. Generalization of results
are limited to the population that was actually sampled from. ⚫ In practice, think hard about potential sources of sampling bias, and
try your best to avoid them Statistics: Unlocking the Power of Data Lock5 Non-Random Samples
Suppose you want to estimate the average number of hours that students spend
studying each week. Which of the following is the best method of sampling? Non-Random Samples
Suppose you want to estimate the average number of hours that students spend studying
each week. Which of the following is the best method of sampling? (a) Go to the library and ask all the students there how much they study (a) Go to the library and ask all the students there how much they study (b) Email all students asking how much they study, and use all the data you get (b) Email all students asking how much they study, and use all the data you get (c) Give a clicker question in this class and force every student to respond (c) Give a clicker question in this class and force every student to respond (d) Stand outside the student center and ask everyone going in how much they study (d) Stand outside the student center and ask everyone going in how much they study
All are flawed! Statistics: Unlocking the Power of Data Lock5 Statistics: Unlocking the Power of Data Bad Methods of Sampling
⚫ Letting your sample be comprised of whoever chooses to participate (volunteer bias)
⚫ People who chose to participate or respond are probably not representative of the entire population
⚪ ⚪ Emailing or mailing the entire population, and then making conclusions about
the population based on whoever chooses to respond
Example: An airline emails all of it’s customers asking them to rate their
satisfaction with their recent travel Lock5 Alcohol, Marijuana, and Driving Bad Methods of Sampling
⚫ Sampling units based on something obviously related to the variable(s) you are studying
⚪ Sampling only students in the library when asking how much
they study, or sampling only students taking a statistics class ⚪ “Today’s Poll” on fitnessmagazine.com asked “Have you ever
hired a personal trainer?”. 27% of respondents said “yes” –
can we infer that 27% of all humans have hired a personal
trainer?
Lock5 Statistics: Unlocking the Power of Data Data Collection and Bias ⚫ The Federal Office of Road Safety in Australia conducted a study on the effects of alcohol and marijuana on performance ⚫ Volunteers who responded to advertisements for the study on rock radio stations were given a random combination of the two drugs,
then their performance was observed Population What is the sample? What is the population?
Is there sampling bias?
⚪ Will the results be informative and/or do you think the study is worth
conducting? Sampling Bias? Sample ⚪
⚪ Other forms of bias? DATA Source: Chesher, G., Dauncey, H., Crawford, J. and Horn, K, “The Interaction between Alcohol and Marijuana: A Dose Dependent Study on the Effects of Human
Moods and Performance Skills,” Report No. C40, Federal Office of Road Safety, Federal Department of Transport, Australia, 1986. Statistics: Unlocking the Power of Data Lock5 Statistics: Unlocking the Power of Data Other Forms of Bias
⚫ Lock5 Question Wording Even with a random sample, data can still be biased, especially
when collected on humans ⚫ Other forms of bias to watch out for in data collection: Question wording
⚪ Context
⚪ Inaccurate responses
⚪ Many other possibilities – examine the specifics of each study!
⚪ ⚫ “Do you think the US should allow public speeches against democracy?” 21% said speeches should be allowed Statistics: Unlocking the Power of Data Question Wording
A random sample was asked: “Should there be a tax cut, or
should money be used to fund new government programs?”
Tax Cut: 60% ⚫ “Do you think the US should not forbid public speeches against democracy?” 39% said speeches should not be forbidden Lock5 Programs: 40% A different random sample was asked: “Should there be a tax
cut, or should money be spent on programs for education, the
environment, health care, crime-fighting, and military
defense?” Source: Rugg, D. (1941). “Experiments in wording questions,” Public Opinion Quarterly, 5, 91-92. Tax Cut: 22%
Context Having Children Having Children If we were to run the question all by itself in the newspaper with a request for responses...
