Question:


Complete the data analysis that is required by the specification.Write up your analysis using your favourite word processing/typesetting program, making sure that all of the working is shown and that is it presented well.

Find out what the public associates to the company name.

Don't use plagiarized sources. Get Your Custom Essay on
300958 Social Web Analytics
Just from $8/Page
Order Essay

The company requests the four parts of analysis.

What does the test’s results tell us about company tweets, and random samples of tweets?

Calculate the share of company tweets in each group.

What do these results say about the Twitter topics of the company and the public opinion about it?

Identify problems with each part’s analytical process. Also, identify how these problems may have affected the results.

Answer to Question: 300958 Social Web Analytics

%load “tm” the library>library(tm)

%Load the csv files>randomSample=read.csv(“/home/prajnan/Downloads/randomSample1.csv”,header=T)

%Creating an corpus from the file.

data frame>sam=Corpus(DataframeSource(randomSample))

%define function that converts spaces to words>toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x)) %convert %convert/' to Space>sam <- tm_map(sam, toSpace, "/") %convert '@? to space > sam

%convert the text to lowercase>sam <- tm_map(sam, content_transformer(tolower)) %Removing numbers>sam <- tm_map(sam, removeNumbers) % removing common stopwords>sam <- tm_map(sam, removeWords, stopwords("english")) % Removing punctuation marks>sam <- tm_map(sam, removePunctuation) %finally, removing the whitespace>sam <- tm_map(sam, stripWhitespace) %constructing term-document matrixfor random sample>dtm=TermDocumentMatrix(sam)

%holding a matrix in a variable1>m=as.matrix(dtm, sparse=TRUE)

% Calculating row sums of matrix>v <- sort(rowSums(m),decreasing=TRUE) %Construct a Table that ranks words according the frequency ranks>d <- data.frame(word = names(v),freq=v) %the 10 highest frequency words>head(d, 10)

Because the 6990-row random sample we were given was large, we had to divide the given fifile into two parts with 3995 rows. We then performed separate analyses of the two parts.

Examining the term-document matrix revealed that it has a large sparsityindex, which means that many of its entries are null.

This is why we set sparse=TRUE while defining the matrix variable of the term-document matrix.

The two cases produced the following output:

For q1(1).png

By comparing these outputs, it is possible to say that the words “http” and “tco” can be compared.

The top ten ten are just, https., new, like. via, get. can and time.words.

The random Sample shows that 5 of the most used words are the most popular.

are related to web/internet linked common terms and are the next 5 terms

Common phrases used in daily conversation.

2for q1(2).png

MercedesBenz was selected for this analysis.

We analyze tweets using the twitteR package.

The R program is:library(httr)library(devtools)library(httk)library(httpuv)library(twitteR)

%authorizing via twittersetup_twitter_oauth(“api_key”,”api_secret”, access_token=NULL,access_secret=NULL)

%searching on twittertw=searchTwitter(“MercedesBenz”,n=1000,lang=”en”)

%convert an output to a Data Frame

rf.

%convert this dataframe to an csv filewrite.csv(df,file=”AboutCompanyTweets.csv”)

The vector was made by adding the row sums to the vector and then merging it with the vector in the previous section to create a matrix that was used for the Chi-square test.

The R Code was used was:3library(httr)library(tm)randomSample=read.csv(“/home/prajnan/Downloads/randomSample1.csv”,header=T)sam=Corpus(DataframeSource(randomSample))toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x)) ") sam- tm_map() (sam. toSpace. "/") sam- tm_map() (sam. toSpace. "@")sam <- tm_map(sam, content_transformer(tolower)) sam >– tm_map[sam, removeNumbers]sam <- tm_map(sam, removeWords, stopwords("english")) sam- tm_map_sam (sam removePunctuation). sam >– tm_map[sam, StripWhitespace]dtm=TermDocumentMatrix(sam)m=as.matrix(dtm, sparse=TRUE)

% for the first vectorv1 <- sort(rowSums(m),decreasing=TRUE)randomSample=read.csv("/home/prajnan/Downloads/randomSample2016.csv",header=T)sam=Corpus(DataframeSource(randomSample))toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x)) ") sam- tm_map() (sam. toSpace. "/") sam- tm_map() (sam. toSpace. "@")sam <- tm_map(sam, content_transformer(tolower)) sam >– tm_map[sam, removeNumbers]sam <- tm_map(sam, removeWords, stopwords("english")) sam- tm_map(1) (sam) sam >– tm_map[sam, StripWhitespace]dtm=TermDocumentMatrix(sam)m=as.matrix(dtm, sparse=TRUE)

%forming second vectorv <- sort(rowSums(m),decreasing=TRUE)a=read.csv("/home/prajnan/AboutCompanyTweets.csv")sam=Corpus(DataframeSource(a))toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x)) ") sam- tm_map() (sam. toSpace. "/") sam- tm_map() (sam. toSpace. "@")sam <- tm_map(sam, content_transformer(tolower)) sam >– tm_map[sam, removeNumbers]sam <- tm_map(sam, removeWords, stopwords("english")) sam- tm_map[sam. RemovePunctuation] sam >– tm_map[sam, StripWhitespace]

sam = tm_map(sam; stemDocument)dtm=TermDocumentMatrix(sam)dtm <- removeSparseTerms(dtm, sparse=0.95)4m2 <- as.matrix(dtm) % forms the third channelw=sort(rowSums(m2),decreasing=TRUE) % Combining first and second vectorsMat=cbind(v,w) Combine second and/or third vectors at %Mat1=cbind(v1,w) % performed the chisquare tests separatelychisq.test(Mat, correct=TRUE)chisq.test(Mat,correct=TRUE) The output from the above code can be seen in fifiguresquare test.png square test.pngsquare-2.png square-2.png In both cases, we find that the p value to calculate the kh2 statistic given is very low. It is lower than 2.2 x 10-16, which at the default signifiance levels leads us to conclude that there is no correlation among the tweets about company and the tweets in random tweets. To retrieve the latest tweets of the company, R used the userTimeline function. The tweets were then combined with the part 2 tweets about Company. Finally, the resulting fifile could be clustered according the k=2 algorithm. These were the R codes usedlibrary(httr)library(devtools)library(httk)library(httpuv)5library(twitteR)setup_twitter_oauth("api_key","api_secret", access_token=NULL,access_secret=NULL)tw1=userTimeline("MercedesBenz",n=1000) df.write.csv(df,file="FromCompanyTweets.csv") library(httr)library(tm)a=read.csv("/home/prajnan/AboutCompanyTweets.csv")b=read.csv("/home/prajnan/CompanyTweets.csv") % Combining tweets from and around the companyd=rbind(a,b)write.csv(d,file="d.csv")randomSample=read.csv("/home/prajnan/d.csv",header=T)sam=Corpus(DataframeSource(randomSample))toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x)) ") sam- tm_map() (sam. toSpace. "/") sam- tm_map() (sam. toSpace. "@")sam <- tm_map(sam, content_transformer(tolower)) sam >– tm_map[sam, removeNumbers]sam <- tm_map(sam, removeWords, stopwords("english")) sam- tm_map(1) (sam) sam >– tm_map[sam, StripWhitespace]

sam = tm_map(sam; stemDocument).dtm=TermDocumentMatrix(sam)dtm <- removeSparseTerms(dtm, sparse=0.95)m2 <- as.matrix(dtm) % Transformation of the matrix to facilitate clusteringm3 <- t(m2) %setting of the seedset.seed(122)%setting k k -2 % performing k–means clustering kmeansResult %getting the productionround(kmeansResult$centers, digits = 3) Figure illustrates the result of clustering fr part3.png fr part3.png Clustering results indicate that 'false' and true' are main words having an average frequency 4 and 0, respectively, and 3 or 1 respectively. The above results indicate that MercedesBenz is not very well-known among the general population. Furthermore, there are few words that both the company and the public use to talk about MercedesBenz. References[1]Cluster Analysis using term frequencies-Available at https://r.789695.n4.nabble.com/Clusteranalysis-using-term-frequencies-td4705033.html(Accessed:08/10/2017)[2]Chi-Square Goodness of fifit test-Available at https://stattrek.com/chi-squaretest/goodness-of-fifit.aspx?Tutorial=AP(Accessed:08/10/2017)[3]R-Companion,Chi-square test of Independence-Available at https://rcompanion.org/rcompanion/b05.html(Accessed :08/10/2017)[4]Report-1:Introduction to k-means clustering with twitter data-Available athttps://rstudio-pubs-static.s3.amazonaws.com/5983af66eca6775f4528a72b8e243a6ecf2d.html(Accessed :08/10/2017) [5]R Data Mining.com – R and Data Mining. Twitter Data Analysis with RAvailable at https://www.rdatamining.com/docs/twitter-analysis-with-r(Accessed:08/10/2017) [6]STHDAWiki,Text Mining and Word Cloud Fundamentals with R.5 simplesteps you should know-Available at https://www.sthda.com/english/wiki/textmining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know(Accessed:08/10/2017)[7]StackOverflflow, twitteR authentication with R:error 401 -Available at https://stackoverflflow.com/questions/29504484/twitterpackage-for-r-authentication-error-401(Accessed:06/10/2017)

Leave a Reply

Your email address will not be published. Required fields are marked *