-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathParichehr Afzali Poisson LM.Rmd
284 lines (182 loc) · 8.78 KB
/
Parichehr Afzali Poisson LM.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
---
title: "Poisson Model on Argumentative markers - Parichehr Afzali 4 July 2021"
output:
word_document:
keep_md: yes
html_document: default
pdf_document: default
---
#Introduction
In the data for this study, the frequency of the word 'because' used by Norwegian writers of English is counted and we are going to look at how different individuals (Male/Female) are using the word in order to mark the argument that they are going to put forward in the causal argumentation scheme in order to back-up their claims in an argumentative text. The dataset presented here is simplified, so the linear model will look into a part of this study which is the difference among the individuals and genders in the frequency of using 'because' in an argumentative text written in English. The plots which is presented highlights the difference between the Norwegian writers in using the word 'becuase'.The prediction of the outcome of the model is that Norwegians use 'because' frequently and there is a slight difference between genders while trying to argue for or against a subject.
```{r}
setwd("C://temp//Norwegian 14 Dec 22-30")
# Load tidyverse, lme4, and afex:
library(tidyverse)
library(lme4)
library(afex)
# Load the dtatasets:
markers <- read.csv("metadata.csv")
becauseNor <- read.csv("Because NO Copy.csv")
# check the datasets:
markers
becauseNor
```
# Data set
The data file contains information about the nationality, native language, foreign languages, institution, title of the text, timing condition, examination condition, the number of years they have studied English, timing of the text, and use of reference tools. However, most of these pieces of data have been chosen to be similar for the purpose of comparability of the three sub-corpora for writers from different language backgrounds. Therefore, the information which varies in this study is the frequency of the words, length of the text, and gender for each individual. The following steps are taken to select the data needed for the plot and the linear model:
```{r}
#Rename the columns that will be used in the plot and the linear model:
becauseNor <- rename(becauseNor, Freq = Center)
becauseNor <- rename(becauseNor, ID = File.name)
becauseNor <- rename(becauseNor, Length = Length.in.words)
becauseNor
# Select the columns that will be used in the linear model:
select (becauseNor,ID, Freq, Length, Gender)
becauseNor <- select (becauseNor,ID, Freq, Length, Gender)
becauseNor
```
```{r}
# Count the number of times each individual has used 'because' in their texts:
Frequency <- becauseNor %>% count(ID)
Frequency
```
```{r}
# Rename the column:
Frequency <- rename(Frequency, Freq = n)
Frequency
```
```{r}
# Left join the dataset containing the counts with the dataset containing length of the texts and gender of the writers:
left_join(Frequency,becauseNor, by = "ID")
```
```{r}
# There is no need to pivot longer this dataframe because there is one observation on any line.
# Assign the changes to a new set:
FreqBecauseNOR <-left_join(Frequency,becauseNor, by = "ID")
FreqBecauseNOR
```
```{r}
# Rename the Freq.x column:
FreqBecauseNOR <- rename(FreqBecauseNOR, n = Freq.x)
FreqBecauseNOR
# Remove the repetitive occurrences:
library(dplyr)
FreqBecauseNOR <- FreqBecauseNOR %>% distinct(ID, n, Gender, Length)
FreqBecauseNOR <- mutate(FreqBecauseNOR,
Rate = n / Length)
FreqBecauseNOR
```
```{r}
# The plot shows the rate (type to token ratio) of the frequency of the word 'because' to the length of the texts written by the Norwegian learners:
FreqBecauseNOR %>%
ggplot(aes(x = ID, y = Rate)) +
geom_bar(stat = 'identity')
FreqBecauseNOR %>%
ggplot(aes(x = ID, y = Rate, fill = ID)) +
geom_bar(stat = 'identity')
library(RColorBrewer)
nb.cols <- 22
mycolors <- colorRampPalette(brewer.pal(8, "Set3"))(nb.cols)
FreqBecauseNOR %>%
ggplot(aes(x = ID, y = Rate, fill = ID)) +
geom_bar(stat = 'identity') +
scale_fill_manual (values = mycolors)+
theme_classic() +
xlab (NULL) +
ylab('Frequency of Because') +
scale_y_continuous(expand = c(0, 0)) +
theme(axis.text.y = element_text(size = 10),
axis.text.x = element_text(angle = 45, hjust = 1,
size = 6, face = 'bold'),
axis.title.y = element_text(size = 15,
face = 'bold',
margin = margin (r =15)),
legend.position = 'blank')
```
```{r}
# Calculate the logfrequency of because*:
FreqBecauseNOR <- mutate(FreqBecauseNOR,
LogFreq = log10(n))
FreqBecauseNOR
FreqBecauseNOR_mdl <- lm( n ~ Length + Gender + LogFreq + Rate,
data = FreqBecauseNOR)
summary(FreqBecauseNOR_mdl)
```
```{r}
#Predictors are standardized:
FreqBecauseNOR <- mutate(FreqBecauseNOR,
n_z = scale(n),
Length_z = scale (Length),
LogFreq_z = scale(LogFreq),
Rate_z = scale (Rate))
```
```{r}
#Model is refitted:
summary(FreqBecauseNOR_mdl_z <- lm( n_z ~ Length_z + LogFreq_z + Rate_z ,
data = FreqBecauseNOR))
```
```{r}
# Female and Male numbers are extracted from the Gender column:
GenderNOR <- filter(FreqBecauseNOR,Gender %in% c('Female', 'Male'))
GenderNOR_mdl <- lm(Rate ~ Gender, data = FreqBecauseNOR)
summary(GenderNOR_mdl)
```
```{r}
# sample t-test is performed for Rate and Gender:
t.test(Rate ~ Gender, data = FreqBecauseNOR, var.equal = TRUE)
```
```{r}
#Gender is converted to factor:
FreqBecauseNOR <- mutate (FreqBecauseNOR, Gender_fac = factor(Gender))
levels (FreqBecauseNOR$Gender_fac)
contrasts(FreqBecauseNOR$Gender_fac)
```
```{r}
# The factor is sum-coded for both levels:
contrasts(FreqBecauseNOR$Gender_fac) <- contr.sum(2)
contrasts(FreqBecauseNOR$Gender_fac)
```
The p-value shows that there is no significant difference between the frequency of using of the word 'because' between male and female writers:
```{r}
summary(GenderNOR_mdl <- lm(Rate ~ Gender_fac,
data = FreqBecauseNOR))
```
The plot shows that on average although male students have been using 'because' slightly more than female students, considering the p-value of 0.45, there is no significant difference between these two groups:
```{r}
GenderNOR %>% ggplot(aes(x= Rate, y = Gender, col= Gender)) +
geom_point() +
facet_wrap(~Gender) +
geom_smooth(formula = y ~ x, method = 'lm')
```
```{r}
summary (lm(Rate ~ Gender * Gender, data = GenderNOR))
```
```{r}
# Rate is centered:
GenderNOR <- mutate(GenderNOR, Rate_c = Rate - mean (Rate, na.rm = TRUE))
# The model is refitted:
summary(GenderNOR_Rate_mdl <- lm(n ~ Rate_c * Gender,
data = GenderNOR))
```
```{r}
GenderNOR %>% ggplot(aes(x= Rate, y = Gender, col= Gender)) +
geom_point() +
facet_wrap(~Gender) +
geom_smooth(formula = y ~ x, method = 'lm', fullrange = TRUE) +
geom_vline(xintercept= 0, size =2, col = 'blue') +
geom_vline(xintercept= mean (GenderNOR$Rate, na.rm = TRUE), linetype = 2)
```
In the next step, length of the text is modeled as a function of the rate (type/token ratio) of the using the word 'because'. The information below about the fixed effects shows that the rate of using 'because' comes down when the length of the text increases (6.7 versus 2.1). Adding two random effects of ID and Gender to the mixed model below (considering the intercepts and slopes of the fixed effects seen below) also indicates that the difference between male and female participants among Norwegian writers is not significant and the variance and the standard deviation is 0.0 as it is indicated below. The message on the first line specifically indicates that there is no significant difference between male and female writers. The other random effect which has been added is ID and the standars deviation of the ID in the random effect table shows the variance among individuals in the rate of using the word 'because' in their texts (0.15).
```{r}
BecauseNor_mdl <- glmer(Length ~ 1 + Rate +
(1|Gender)+ (1|ID),
data = FreqBecauseNOR,
family= poisson)
summary(BecauseNor_mdl)
```
Using the coef function for each random effect shows what the average Log number of the word 'because' is used by each individual and across genders. The variety of the numbers in the intercept column indicates that this model is not assuming that each individual has the same intercept but it is allowing different people to have different rates, therefore some people use the word 'because' than others. However the slope for the rate is the same for every person which shows the rate of using the word functions in the same for everybody.
```{r}
coef(BecauseNor_mdl)
coef(BecauseNor_mdl)$Gender
coef(BecauseNor_mdl)$ID
```
```