$24
The result of this coursework assignment should be combined into a single PDF report. The R code written for the coursework should be included as an appendix, as well as the output of analyses.
Part 1 – Design and set-up of true experiment
Write a plan for conducting an experiment on group of human test subjects. As a group you are allowed to select your own topic for this experiment. The plan should include the following items.
The motivation for the planned research. (Max 250 words)
The theory underlying the research. (Max 250 words) Preferable based on theories reported in literature
Research questions that will be examined in the experiment (or alternatively the hypothesis that will be tested in the experiment)
The related conceptual model, this model should include:
Independent variable(s) o Dependent variable
o Mediating variable (at least 1) o Moderating variable (at least 1)
Experimental Design (the study should have a true experimental design)
Experimental procedure (how the experiment will be executed step by step)
Measures
Participants
Suggested statistical analyses
Part 2 – Generalized linear models
Question 1 Twitter sentiment analysis (Between groups – single factor) Analyzing Twitter tweets about a specific topic or person, it is possible to a get an overall sense the sentiment of these tweets. This is done by counting the number of positive and negative words in a tweet. The main aim of this question is that you compare the sentiment of the tweets related to at least 3 famous individuals (i.e. celebrities) that are often the topic of discussion on Twitter (in English). The file “TwitterAnalysis.R” shows how you can obtain tweets automatically. This program uses the following file which you need to place in you working directory: sentiment3.R, negative-words.txt, and positive-words.txt.
For the analysis you need to have a twitter account to create a so called “twitter app” on apps.twitter.com. Once you have done this, obtain information under
1
“Keys and Access Tokens” and enter these in your own file with your personal twitter variables. For this you can use the template file “your_twitter.R”.
Once you have done this, conduct the following analyses on the obtained data set.
Make a conceptual model for the following research question: Is there a difference in the sentiment of the tweets related to the different celebrities?
Analyze the homogeneity of variance of sentiments of the tweets of the different celebrities
Graphically examine the variation in tweets’ sentiments for each celebrity (e.g. histogram, density plot etc.)
Graphically examine the mean sentiments of tweets for each celebrity
Use a linear model to analyze whether the knowledge to which celebrity a tweet relates has a significant impact on explaining the sentiments of the tweets.
If a model that includes the celebrity is better in explaining the sentiments of tweets than a model without such predictor, conduct a post-hoc analysis with e.g. Bonferroni correction, to examine which of celebrity tweets differ from the other celebrity tweets
Write a small section for a scientific publication, in which you report the results of the analyses of point 2-6, and explain the conclusions that can be drawn.
Include the annotated R script (excluding your personal Keys and Access Tokens information) in the appendix of the report.
Question 2 – Website visits (between groups – Two factors)
For this question you have to use the data file webvisit[x].csv. There are 3 versions of this data set (0,1, and 2). To determine the version your group has to select, add up the age (in years, at the first official day of the course) of the group members and take modulo 3 of this number. The obtained number is the version your group has to complete.
The file represents data obtained from a webserver from a company X. The company runs an A-B study to test two versions of their website (0 = old, 1 = new version. The company targets two markets and therefore has two web portal entries (0=consumers, 1 = companies). For each visit to their website, the data file shows the number of pages the visitor visited. The aim of the analysis is to examine whether the version of the website, the portal, or combination of the two had an impact on number of pages visited.
Make a conceptual model underlying this research question
Graphically examine the variation in page visits for different factors levels (e.g. histogram, density plot etc.)
Statistically test if variable page visits deviates from normal distribution
Conduct a model analysis, to examine the added values of adding 2 factors and interaction between the factors in the model to predict page visits.
2
If the analysis shows a significant two-way interaction effect, conduct a Simple Effect analysis to explore this interaction effect in more detail.
Write a small section for a scientific publication, in which you report the results of the analyses of point 2-6, and explain the conclusions that can be drawn.
Include annotated R script in the appendix of the report.
Question 3: Linear regression analysis
Select a data set from the following sites and conduct a linear regression:
(http://www.stat.ufl.edu/~winner/datasets.html)
http://support.minitab.com/en-us/datasets/ (install trail version of minitab to open the minitab file. Copy data to excel file and open this in R)
Dutch CBS Statistics Netherlands http://statline.cbs.nl/Statweb/?LA=en
The data you select should meet the following requirements:
n 100
at least 3 independent variables of interval (or ratio) level
a dependent variable of interval (or ratio) level and which is reasonable normally distributed
Independence of observations
Conduct the following analysis on the data set:
Make a conceptual model underlying this research question
Graphical analysis of the distribution of the dependent variable, e.g. histogram, density plot
Scatter plots between dependent variable and the predictor variables
Conduct a multiple linear regression (including confidence intervals, and beta-values)
Examine assumptions underlying linear regression. E.g collinearity and analyses of the residuals, e.g. normal distributed (QQ plot), linearity assumption, homogeneity of variance assumption. Where possible support examination with visual inspection.
Examine effect of single cases on the predicted values (e.g. DFBeta, Cook’s distance)
Write a small section for a scientific publication, in which you explain the data set examined, report the results of the analyses of point 2-6, and explain the conclusions that can be drawn.
Include annotated R script in the appendix of the report.
Question 4 Logistic regression analysis
Select a data set from the following sites and conduct a logistic regression:
http://www.stat.ufl.edu/~winner/datasets.html
http://support.minitab.com/en-us/datasets/ (use foreign and read.mtp function to read minitab files)
Dutch CBS Statistics Netherlands http://statline.cbs.nl/Statweb/?LA=en
3
The data you select should meet the following requirements:
n 50
at least two independent variables
dichotomous a dependent variable
Independence of observations
Conduct the following analysis on the data set:
Make a conceptual model underlying this research question
Conduct a logistic regression, examine whether adding individual indicators in the model improves the model compared to Null model. Make a final model with only significant predictor(s). For this model, calculate the pseudo R-square. Calculate the odd ratio for the predictors and their confidence interval
Make a crosstable of the predicted and observed response
Write a small section for a scientific publication, in which you explain the data set examined, report the results of the analyses of point 2 and 3, and explain the conclusions that can be drawn.
Include annotated R script in the appendix of the report.
Part 3 – Multilevel model
For this part of the assignment you need to use the file set[x].cvs. To determine the version your group has to select, add up the student ID number from the group members and take modulo 3 of this number. The file includes longitudinal data collected from a large group of participants (Subjects) that in multiple sessions (session) completed a learning exercise for which exercise score (score) was collected. Note that the number of exercises completed between participants varies. Conduct a multilevel analysis to see whether over sessions the exercise score systematically vary. Besides a baseline model, create a model that includes session as a fixed factor, and uses a random intercept for the participants. Give an interpretation of the results and report the statistical results in a small paragraph for scientific publication.
Conduct the following analysis
Use graphics to inspect the distribution of the score, and relationship between session and score
Conduct multilevel analysis and calculate 95% confidence intervals, determine:
If session has impact on people score
If there is significant variance between the participants in their score
Write a small section for a scientific publication, in which you explain the data set examined, report the results of the analyses of point 2 and 3, and explain the conclusions that can be drawn.
Include annotated R script in the appendix of the report
4