Related Post
]]>According to Wikipedia, Natural language generation (NLG) is the natural language processing task of generating natural language from a machine representation system such as a knowledge base or a logical form. Psycholinguists prefer the term language production when such formal representations are interpreted as models for mental representations.
While there are so many different ways starting from a simple-rule based Text Generation to using highly advanced Deep Learning Models to perform Natural Language Generation, Here we will explore a simple but effective way of doing NLG with Markov Chain Model.
Please note, we will not get into the internals of building a Markov chain rather this article would focus on implementing the solution using the Python Module markovify
Markovify is a simple, extensible Markov chain generator. Right now, its main use is for building Markov models of large corpora of text and generating random sentences from that. But, in theory, it could be used for other applications.
pip install markovify
This includes the entire corpus of articles published by the ABC website in the given time range. With a volume of 200 articles per day and a good focus on international news, we can be fairly certain that every event of significance has been captured here. This dataset can be downloaded from Kaggle Datasets.
Markov chains, named after Andrey Markov, are mathematical systems that hop from one “state” (a situation or set of values) to another. For example, if you made a Markov chain model of a baby’s behavior, you might include “playing,” “eating”, “sleeping,” and “crying” as states, which together with other behaviors could form a ‘state space’: a list of all possible states. In addition, on top of the state space, a Markov chain tells you the probability of hopping, or “transitioning,” from one state to any other state — -e.g., the chance that a baby currently playing will fall asleep in the next five minutes without crying first. Read more about how Markov Chain works in this interactive article by Victor Powell.
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) import markovify #Markov Chain Generator
inp = pd.read_csv('../input/abcnews-date-text.csv') inp.head(3) publish_date headline_text 020030219 aba decides against community broadcasting lic… 120030219 act fire witnesses must be aware of defamation 220030219a g calls for infrastructure protection summit
text_model = markovify.NewlineText(inp.headline_text, state_size = 2)
# Print ten randomly-generated sentences using the built model for i in range(10): print(text_model.make_sentence()) federal treasurer to deliver grim climate prediction the myth of mum and baby injured after crashing into tree aussie duo in sync xenophon calls for more uranium mines tough penalties fake aboriginal art exhibition katherine govt considers afghan troop request one mans plan to fight bushfire at neath fifa confirms candidates for 2010 start no white flag on costly hail storm payouts unlikely before christmas is it really like
Now, this text could become input for a Twitter Bot, Slack Bot or even a Parody Blog. And that’s the point of generating text that sounds less like a Machine more like written by humans.
References:
Related Post
]]>Related Post
]]>Imagine, You run an online business like Amazon.com and you want to plan Server Resources for the next year – It is imperative that you need to know when your load is going to spike (or at least when did it spike in retrospective to believe it’ll repeat again) and that is where Time Series Anomaly Detection is what you are in need of. While there are some packages like Twitter’s AnomalyDetection that has been doing this job, there is a new candidate in the town – anomalize – that does something specific which no other Anomaly Detection packages were doing. That is Tidy Anomaly Detection.
Please note, The purpose of this article is to help you perform Anomaly Detection in R – The Tidy Way and not to teach you the principles and concepts of Anomaly Detection or Time Series Data.
Sorry to say this! Data Scientists who use R are known to write clumsy code – code that is not very readable and code that is not very efficient but this trend has been changing because of the tidy
principle popularized by Hadley Wickham who supposedly doesn’t need any introduction in R universe, because his tidyverse
is what contributes to the efficiency and work of a lot of R Data scientists. Now, this new package anomalize
open-sourced by Business Science does Time Series Anomaly Detection that goes inline with other Tidyverse packages (or packages supporting tidy data) – with one of the most used Tidyverse functionality – compatibility with the pipe %>%
operator to write readable and reproducible data pipeline.
The Stable version of the R package anomalize is available on CRAN that could be installed like below:
install.packages('anomalize')
The latest development version of anomalize is available on github that could be installed like below:
#install.packages('devtools') devtools::install_github("business-science/anomalize")
Considering that the development version doesn’t require compiling tools, It’s better to install the development version from github that would be more bug-free and with latest features.
It’s easier to learn a new concept or code piece by actually doing and relating it to what we are of. So, to understand the Tidy Anomaly Detection in R, We will try to detect anomalies in Bitcoin Price since 2017.
We use the following 3 packages for to solve the above case:
library(anomalize) #tidy anomaly detectiom library(tidyverse) #tidyverse packages like dplyr, ggplot, tidyr library(coindeskr) #bitcoin price extraction from coindesk
We use get_historic_price()
from coindeskr
to extract historic bitcoin price from Coindesk. The resulting dataframe is stored in the object btc
btc <- get_historic_price(start = "2017-01-01")
For Anomaly Detection using anomalize, we need to have either a tibble
or tibbletime
object. Hence we have to convert the dataframe btc into a tibble object that follows a time series shape and store it in btc_ts
.
btc_ts <- btc %>% rownames_to_column() %>% as.tibble() %>% mutate(date = as.Date(rowname)) %>% select(-one_of('rowname'))
Just looking at the head of btc_ts to see sample data:
head(btc_ts) Price date 1 998. 2017-01-01 2 1018. 2017-01-02 3 1031. 2017-01-03 4 1130. 2017-01-04 5 1006. 2017-01-05 6 896. 2017-01-06
One of the important things to do with Time Series data before starting with Time Series forecasting or Modelling is Time Series Decomposition where the Time series data is decomposed into Seasonal, Trend and remainder components. anomalize has got a function time_decompose()
to perform the same. Once the components are decomposed, anomalize
can detect and flag anomalies in the decomposed data of the reminder component which then could be visualized with plot_anomaly_decomposition()
.
btc_ts %>% time_decompose(Price, method = "stl", frequency = "auto", trend = "auto") %>% anomalize(remainder, method = "gesd", alpha = 0.05, max_anoms = 0.2) %>% plot_anomaly_decomposition()
As you can see from the above code, the decomposition happens based on ‘stl’ method which is the common method of time series decomposition but if you have been using Twitter’s AnomalyDetection, then the same can be implemented in anomalize by combining time_decompose(method = “twitter”) with anomalize(method = "gesd")
. Also the ‘stl’ method of decomposition can also be combined with anomalize(method = "iqr")
for a different IQR based anomaly detection.
Anomaly Detection and Plotting the detected anomalies are almost similar to what we saw above with Time Series Decomposition. It’s just that decomposed components after anomaly detection are recomposed back with time_recompose()
and plotted with plot_anomalies()
. The package itself automatically takes care of a lot of parameter setting like index, frequency and trend, making it easier to run anomaly detection out of the box with less prior expertise in the same domain.
btc_ts %>% time_decompose(Price) %>% anomalize(remainder) %>% time_recompose() %>% plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.5)
It could be very well inferred from the given plot how accurate the anomaly detection is finding out the Bitcoin Price madness that happened during the early 2018.
If you are interested in extracting the actual datapoints which are anomalies, the following code could be used:
btc_ts %>% time_decompose(Price) %>% anomalize(remainder) %>% time_recompose() %>% filter(anomaly == 'Yes') Converting from tbl_df to tbl_time. Auto-index message: index = date frequency = 7 days trend = 90.5 days # A time tibble: 58 x 10 # Index: date date observed season trend remainder remainder_l1 remainder_l2 anomaly recomposed_l1 1 2017-11-12 5857. 36.0 7599. -1778. -1551. 1672. Yes 6085. 2 2017-12-04 11617. 11.2 9690. 1916. -1551. 1672. Yes 8150. 3 2017-12-05 11696. -2.01 9790. 1908. -1551. 1672. Yes 8237. 4 2017-12-06 13709. -9.11 9890. 3828. -1551. 1672. Yes 8330. 5 2017-12-07 16858. 0.0509 9990. 6868. -1551. 1672. Yes 8439. 6 2017-12-08 16057. -28.1 9971. 6114. -1551. 1672. Yes 8393. 7 2017-12-09 14913. -8.03 9953. 4969. -1551. 1672. Yes 8394. 8 2017-12-10 15037. 36.0 9934. 5067. -1551. 1672. Yes 8420. 9 2017-12-11 16700. 11.2 9916. 6773. -1551. 1672. Yes 8376. 10 2017-12-12 17178. -2.01 9897. 7283. -1551. 1672. Yes 8345. # ... with 48 more rows, and 1 more variable: recomposed_l2
Thus, anomalize makes it easier to perform anomaly detection in R with cleaner code that also could be used in any data pipeline built using tidyverse. The code used here are available on my github. If you would like to know more about Time Series Forecasting in R, Check out Professor Rob Hyndman’s course on Datacamp.
Related Post
]]>Related Post
]]>Research paper topic modeling is an unsupervised machine learning method that helps us discover hidden semantic structures in a paper, that allows us to learn topic representations of papers in a corpus. The model can be applied to any kinds of labels on documents, such as tags on posts on the website.
The research paper text data is just a bunch of unlabeled texts and can be found here.
We use the following function to clean our texts and return a list of tokens:
import spacy spacy.load('en') from spacy.lang.en import English parser = English() def tokenize(text): lda_tokens = [] tokens = parser(text) for token in tokens: if token.orth_.isspace(): continue elif token.like_url: lda_tokens.append('URL') elif token.orth_.startswith('@'): lda_tokens.append('SCREEN_NAME') else: lda_tokens.append(token.lower_) return lda_tokens
We use NLTK’s Wordnet to find the meanings of words, synonyms, antonyms, and more. In addition, we use WordNetLemmatizer to get the root word.
import nltk nltk.download('wordnet') from nltk.corpus import wordnet as wn def get_lemma(word): lemma = wn.morphy(word) if lemma is None: return word else: return lemma from nltk.stem.wordnet import WordNetLemmatizer def get_lemma2(word): return WordNetLemmatizer().lemmatize(word)
Filtering out stop words:
nltk.download('stopwords') en_stop = set(nltk.corpus.stopwords.words('english'))
Now we can define a function to prepare the text for topic modelling:
def prepare_text_for_lda(text): tokens = tokenize(text) tokens = [token for token in tokens if len(token) > 4] tokens = [token for token in tokens if token not in en_stop] tokens = [get_lemma(token) for token in tokens] return tokens
Open up our data, read line by line, for each line, prepare text for LDA, then add to a list.
Now we can see how our text data are converted:
import random text_data = [] with open('dataset.csv') as f: for line in f: tokens = prepare_text_for_lda(line) if random.random() > .99: print(tokens) text_data.append(tokens) [‘sociocrowd’, ‘social’, ‘network’, ‘base’, ‘framework’, ‘crowd’, ‘simulation’] [‘detection’, ‘technique’, ‘clock’, ‘recovery’, ‘application’] [‘voltage’, ‘syllabic’, ‘companding’, ‘domain’, ‘filter’] [‘perceptual’, ‘base’, ‘coding’, ‘decision’] [‘cognitive’, ‘mobile’, ‘virtual’, ‘network’, ‘operator’, ‘investment’, ‘pricing’, ‘supply’, ‘uncertainty’] [‘clustering’, ‘query’, ‘search’, ‘engine’] [‘psychological’, ‘engagement’, ‘enterprise’, ‘starting’, ‘london’] [‘10-bit’, ‘200-ms’, ‘digitally’, ‘calibrate’, ‘pipelined’, ‘using’, ‘switching’, ‘opamps’] [‘optimal’, ‘allocation’, ‘resource’, ‘distribute’, ‘information’, ‘network’] [‘modeling’, ‘synaptic’, ‘plasticity’, ‘within’, ‘network’, ‘highly’, ‘accelerate’, ‘i&f’, ‘neuron’] [‘tile’, ‘interleave’, ‘multi’, ‘level’, ‘discrete’, ‘wavelet’, ‘transform’] [‘security’, ‘cross’, ‘layer’, ‘protocol’, ‘wireless’, ‘sensor’, ‘network’] [‘objectivity’, ‘industrial’, ‘exhibit’] [‘balance’, ‘packet’, ‘discard’, ‘improve’, ‘performance’, ‘network’] [‘bodyqos’, ‘adaptive’, ‘radio’, ‘agnostic’, ‘sensor’, ‘network’] [‘design’, ‘reliability’, ‘methodology’] [‘context’, ‘aware’, ‘image’, ‘semantic’, ‘extraction’, ‘social’] [‘computation’, ‘unstable’, ‘limit’, ‘cycle’, ‘large’, ‘scale’, ‘power’, ‘system’, ‘model’] [‘photon’, ‘density’, ‘estimation’, ‘using’, ‘multiple’, ‘importance’, ‘sampling’] [‘approach’, ‘joint’, ‘blind’, ‘space’, ‘equalization’, ‘estimation’] [‘unify’, ‘quadratic’, ‘programming’, ‘approach’, ‘mix’, ‘placement’]
First, we are creating a dictionary from the data, then convert to bag-of-words corpus and save the dictionary and corpus for future use.
from gensim import corpora dictionary = corpora.Dictionary(text_data)corpus = [dictionary.doc2bow(text) for text in text_data] import pickle pickle.dump(corpus, open('corpus.pkl', 'wb')) dictionary.save('dictionary.gensim')
We are asking LDA to find 5 topics in the data:
import gensim NUM_TOPICS = 5 ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15) ldamodel.save('model5.gensim') topics = ldamodel.print_topics(num_words=4) for topic in topics: print(topic) (0, ‘0.034*”processor” + 0.019*”database” + 0.019*”issue” + 0.019*”overview”’) (1, ‘0.051*”computer” + 0.028*”design” + 0.028*”graphics” + 0.028*”gallery”’) (2, ‘0.050*”management” + 0.027*”object” + 0.027*”circuit” + 0.027*”efficient”’) (3, ‘0.019*”cognitive” + 0.019*”radio” + 0.019*”network” + 0.019*”distribute”’) (4, ‘0.029*”circuit” + 0.029*”system” + 0.029*”rigorous” + 0.029*”integration”’)
Topic 0 includes words like “processor”, “database”, “issue” and “overview”, sounds like a topic related to database. Topic 1 includes words like “computer”, “design”, “graphics” and “gallery”, it is definite a graphic design related topic. Topic 2 includes words like “management”, “object”, “circuit” and “efficient”, sounds like a corporate management related topic. And so on.
With LDA, we can see that different document with different topics, and the discriminations are obvious.
Let’s try a new document:
new_doc = 'Practical Bayesian Optimization of Machine Learning Algorithms' new_doc = prepare_text_for_lda(new_doc) new_doc_bow = dictionary.doc2bow(new_doc) print(new_doc_bow) print(ldamodel.get_document_topics(new_doc_bow)) [(38, 1), (117, 1)] [(0, 0.06669136), (1, 0.40170625), (2, 0.06670282), (3, 0.39819494), (4, 0.066704586)]
My new document is about machine learning algorithms, the LDA output shows that topic 1 has the highest probability assigned, and topic 3 has the second highest probability assigned. We agreed!
Remember that the above 5 probabilities add up to 1.
Now we are asking LDA to find 3 topics in the data:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 3, id2word=dictionary, passes=15) ldamodel.save('model3.gensim') topics = ldamodel.print_topics(num_words=4) for topic in topics: print(topic) (0, ‘0.029*”processor” + 0.016*”management” + 0.016*”aid” + 0.016*”algorithm”’) (1, ‘0.026*”radio” + 0.026*”network” + 0.026*”cognitive” + 0.026*”efficient”’) (2, ‘0.029*”circuit” + 0.029*”distribute” + 0.016*”database” + 0.016*”management”’)
We can also find 10 topics:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 10, id2word=dictionary, passes=15) ldamodel.save('model10.gensim') topics = ldamodel.print_topics(num_words=4) for topic in topics: print(topic) (0, ‘0.055*”database” + 0.055*”system” + 0.029*”technical” + 0.029*”recursive”’) (1, ‘0.038*”distribute” + 0.038*”graphics” + 0.038*”regenerate” + 0.038*”exact”’) (2, ‘0.055*”management” + 0.029*”multiversion” + 0.029*”reference” + 0.029*”document”’) (3, ‘0.046*”circuit” + 0.046*”object” + 0.046*”generation” + 0.046*”transformation”’) (4, ‘0.008*”programming” + 0.008*”circuit” + 0.008*”network” + 0.008*”surface”’) (5, ‘0.061*”radio” + 0.061*”cognitive” + 0.061*”network” + 0.061*”connectivity”’) (6, ‘0.085*”programming” + 0.008*”circuit” + 0.008*”subdivision” + 0.008*”management”’) (7, ‘0.041*”circuit” + 0.041*”design” + 0.041*”processor” + 0.041*”instruction”’) (8, ‘0.055*”computer” + 0.029*”efficient” + 0.029*”channel” + 0.029*”cooperation”’) (9, ‘0.061*”stimulation” + 0.061*”sensor” + 0.061*”retinal” + 0.061*”pixel”’)
pyLDAvis is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.
Visualizing 5 topics:
dictionary = gensim.corpora.Dictionary.load('dictionary.gensim') corpus = pickle.load(open('corpus.pkl', 'rb')) lda = gensim.models.ldamodel.LdaModel.load('model5.gensim') import pyLDAvis.gensim lda_display = pyLDAvis.gensim.prepare(lda, corpus, dictionary, sort_topics=False) pyLDAvis.display(lda_display)
Saliency: a measure of how much the term tells you about the topic.
Relevance: a weighted average of the probability of the word given the topic and the word given the topic normalized by the probability of the topic.
The size of the bubble measures the importance of the topics, relative to the data.
First, we got the most salient terms, means terms mostly tell us about what’s going on relative to the topics. We can also look at individual topic.
Visualizing 3 topics:
lda3 = gensim.models.ldamodel.LdaModel.load('model3.gensim') lda_display3 = pyLDAvis.gensim.prepare(lda3, corpus, dictionary, sort_topics=False) pyLDAvis.display(lda_display3)
Visualizing 10 topics:
lda10 = gensim.models.ldamodel.LdaModel.load('model10.gensim') lda_display10 = pyLDAvis.gensim.prepare(lda10, corpus, dictionary, sort_topics=False) pyLDAvis.display(lda_display10)
When we have 5 or 10 topics, we can see certain topics are clustered together, this indicates the similarity between topics. What a a nice way to visualize what we have done thus far!
Try it out, find a text dataset, remove the label if it is labeled, and build a topic model yourself!
Source code can be found on Github. I look forward to hearing any feedback or questions.
Related Post
]]>Related Post
]]>The lecture will demonstrate the ARIMA which is purely univariable method of forecasting. The main philosophy here is: “Let the data speak for itself”
The lecture will cover both the background theorems and its execution through R. In this post, we will mainly discuss some theoretical foundation only and in the next few posts, we will discuss the practical aspects of ARIMA.
ARIMA is an acronym that stands for AutoRegressive Integrated Moving Average. So, it is necessary to know the underlying properties of AutoRegressive(AR), Moving Average (MA) and order of integration.
We start with \(Y_t\) which is non-stationary in nature (for example, GDP of India, Stock market indices etc.).
Let \(Y_t\) is modelled as:
$$ Y_t =\delta+\alpha_1Y_{t-1}+u_t~~~~~~~~\dots(1)\\where~~\delta=\text {any constant}\\u_t=\text {white noise} $$
The value of \(Y\) at time \(t\) depends on its value in the previous time period and a random term. In other words, this model says that the forecast value of $Y$ at time \(t\) is simply some proportion \((=\alpha_1 )\) of its value at time $(t-1)$ plus a random shock or disturbance at time \(t\).
For the stationarity of the series, it is required that \(|\alpha_1|<1\).
Again if we write the following model:
$$ Y_t =\delta+\alpha_1Y_{t-1}+\alpha_2Y_{t-2}+u_t~~~~~~~~\dots(2) ~~~~~~~~\implies AR(2) ~~Process~$$
That is, the value of \(Y\) at time \(t\) depends on its value in the previous two time periods.
Following this fashion, we can write:
$$ Y_t=\delta+\alpha_1Y_{t-1}+\alpha_2Y_{t-2}+…….+\alpha_pY_{t-1}+u_t~~~~~~~~\dots(3) ~~~~~~~~\implies AR(p) ~~Process~$$
The mean is given as:
$$\begin {aligned} E(Y_t)&=E( \delta+\alpha_1 Y_{t-1}+u_t ) \\ E( Y_t )&= E( \delta) +E( \alpha_1 Y_{t-1} )+E( u_t )\\E( Y_t )&=\delta+\alpha_1 E( Y_{t-1} )+0 \end {aligned} $$
Assuming that the series is stationary, \(E(Y_t)=E(Y_{t-1})=\mu ~~\text {(common mean)},\)
$$ \begin {aligned} \mu&=\delta+\alpha_1 \mu\\\implies \mu&=\frac {\delta}{1-\alpha_1} \end {aligned} $$
The variance is calculated as follows:
By independence of errors term and values of \(Y_t\):
$$\begin {aligned} Var(Y_t) &= Var (\delta)+Var(\phi_1 Y_{t-1})+Var(u_t)\\ Var(Y_t)&=\phi_1^2 Var(Y_{t-1})+\sigma^2_u \end {aligned} $$
By stationary assumption, \(Var(Y_t)=Var(Y_{t-1})\) and substituting this, you will get:
$$ (1-\phi_1^2)>0 ~~~~~~~since ~~Var (Y_t)>0\\\implies |\phi_1|<1 $$
The ACF at lag \(k\) is defined as:
$$ \rho_k=\frac {\text {Covariance at lag k}}{variance}=\frac {\gamma_k}{\gamma_0} $$
Since both covariance and variance are measured in the same units of measurement, \(\rho_k\)
is a unitfree, or pure, number. It lies between −1 and +1, as any correlation coefficient does.
If we plot \(\rho_k\) against \(k\), the graph we obtain is known as the correlogram. It helps us to identify the stationarity of the time series data.
For PACF, it is the partial autocorrelation which are plotted against the number of lag taken.
Let us write \(Y_t\) as follows:
$$ Y_t= \mu+\beta_0u_t+\beta_1 u_{t-1} ~~~~~~~~\dots (4) ~~~~~~~~\implies MA(1)~~ Process \\where~~~\mu= constant~~~~and ~~u_t=\text {white noise term} $$
That is, \(Y_t\) at time \(t\) is equal to a constant \((\mu)\) plus a moving average \((\beta_0u_t+\beta_1 u_{t-1})\) of the current and past error terms.
So, the \(MA(2)\) process can be written as:
$$ Y_t= \mu+\beta_0u_t+\beta_1 u_{t-1}+\beta_2 u_{t-2} ~~~~~~~~\dots (5) ~~~~~~~~\implies MA(2)~~ Process $$
The \(MA(q)\) process can be written as:
$$ Y_t= \mu+\beta_0u_t+\beta_1 u_{t-1}+…….+\beta_q u_{t-q} ~~~~~~~~\dots (6) ~~~~~~~~\implies MA(q)~~ Process $$
$$\begin {aligned} Mean &=E(y_t)=\mu\\Variance&=var(y_t)=\sigma_{\mu}^2 (1+\beta_1^2)\\&\text {The Autocorrelation fucntion (ACF):}\\&\rho_1=\frac {\beta_1}{1+\beta_1^2}~~~\&~~\rho_q=0~~ \text {for all}~~q\ge2 \end {aligned} $$
Note that the only nonzero value in the theoretical ACF is for lag 1. All other autocorrelations are 0. Thus a sample ACF with a significant autocorrelation only at lag 1 is an indicator of a possible $MA(1)$ model.
$$\begin {aligned} Mean &=E(y_t)=\mu\\Variance&=var(y_t)=\sigma_{\mu}^2 (1+\beta_1^2+\beta_2^2)\\&\text {Autocorrelation function (ACF) is:}\\\rho_1&=\frac {1+\beta_1\beta_2}{1+\beta_1^2+\beta_2^2}~~~\&\rho_2=\frac {\beta_2}{1+\beta_1^2+\beta_2^2}~~\rho_k=0~~ \text {for all}~~q\ge3 \end {aligned} $$
Note that the only nonzero values in the theoretical ACF are for lags 1 and 2. Autocorrelations for higher lags are 0. So, a sample ACF with significant autocorrelations at lags 1 and 2, but non-significant autocorrelations for higher lags indicates a possible \(MA(2)\) model.
If it is likely that \(Y\) has characteristics of both AR and MA and it is called ARMA. Thus, \(Y_t\) follows an \(ARMA(1, 1)\) process if it can be written as:
$$ Y_t = \theta + \alpha_1 Y_{t-1}+\beta_0 u_t+\beta_1 u_{t-1}~~~~~~~~~~\dots (7)~~~~~~\implies ARMA(1,1) $$
So, the \(ARMA(p,q)\) can be written as:
$$ Y_t = \theta + \alpha_1 Y_{t-1}+………\alpha_p Y_{t-p}+\beta_0 u_t+\beta_1 u_{t-1}+……..+\beta_q u_{t-q}~~~~~~~~~~\dots (8)~~~~~~\implies ARMA(p,q) $$
The earlier models of time series are based on the assumptions that the time series variable is stationary (at least in the weak sense).
But in practical, most of the time series variables will be non-stationary in nature and they are intergrated series.
This implies that you need to take either the first or second difference of the non-stationary time series to convert them into stationary.
As such they may \(I(1)\) or \(I(2)\) and so on.
Therefore, if you have to difference a time series \(d\) times to make it stationary and then apply the \(ARMA( p, q)\) model to it, it is said that
– the original time series is \(ARIMA( p, d, q)\), that is, it is an ARIMA series
where \(p\) denotes the number of autoregressive terms,
\(d\) the number of times the series has to be differenced before it becomes stationary, and
\(q\) the number of moving average terms.
First of all, you must have to identify the particular series is stationary or if not the order of integration to convert it to stationary. That is, you have to identity the ARIMA series.
The BJ Methods can answer this question.
The steps in BJ-Methods are as under:
As a starting point it is always advisable to examine the data visually before going for details mathematical modeling. So, the examination of data involves the following things:
tsclean()
function in R is a convenient method for outlier removal and replacing the missing valuesThough ARIMA can be fitted to both seasonal and non-seasonal data. Seasonal ARIMA requires a more complicated specification of the model structure, although the process of determining \((p, d, q)\) is similar to that of choosing non-seasonal order parameters.
Therefore, sometimes decomposing the data will be give additional benefit. So, we are seeking to answer the following questions in this step:
decompose()
or stl()
function in R to examine and possibly remove components of the seriesThe identification step involves the following:
arima()
function, or automatically generate a set of optimal \((p,d,q)\) using auto.arima()
. This function searches through combinations of order parameters and picks the set that optimizes model fit criteria. There exist a number of such criteria for comparing the quality of fit across multiple models. Two of the most widely used is Akaike information criteria (AIC) and Bayesian information criteria (BIC). These criteria are closely related and can be interpreted as an estimate of how much information would be lost if a given model is chosen. When comparing models, one wants to minimize AIC and BIC.auto.arima()
can be very useful, it is still important to complete previous steps in order to understand the series and interpret model results. Note that auto.arima()
also allows the user to specify a maximum order for (p, d, q)
, which is set to 5 by default.Related Post
]]>Related Post
]]>forestplot
package is used in R. However, I find the ggplot2
to have more advantages in making Forest Plots, such as enable inclusion of several variables with many categories in a lattice form. You can also use any scale of your choice such as log scale etc. In this post, I will introduce how to plot Risk Ratios and their Confidence Intervals of several conditions.
Lets start by loading the package ggplot2
in our R.
library(ggplot2)
For demostration purposes, I will load a data which contains few columns named Condition, RiskRatio, LowerLimit, UpperLimit, and Group. The current data is in long format; if your data is not in this format, check out the melt
function, in the reshape
package, it provides a really easy way to reshape data into long format. The reference group RR=1
. My data is in xlsx
format, therefore, I load data using read_excel
in readxl
package as demonstrated below.
RR_data <- data.frame(read_excel("C:/Users/fatakora/Dropbox/MY write R write ups/Risk_Ratio_data.xlsx")) Condition RiskRatio LowerLimit UpperLimit Group 1 Condition1 1.0512 1.0174 1.0863 GroupB 2 Condition2 1.0169 0.9638 1.0731 GroupB 3 Condition3 1.0391 1.0185 1.0601 GroupB 10 Condition1 1.1057 1.0667 1.1463 GroupC 11 Condition2 1.4204 1.3471 1.4978 GroupC 12 Condition3 1.0344 1.0105 1.0589 GroupC 19 Condition1 1.0000 1.0000 1.0000 GroupA 20 Condition2 1.0000 1.0000 1.0000 GroupA 21 Condition3 1.0000 1.0000 1.0000 GroupA
For the sake of easy demonstrations and simplicity, we truncate the upper limits to 2 as maximum and lower limits to 0.5 as minimum.
RR_data$UpperLimit[RR_data$UpperLimit > 2] = 2 RR_data$LowerLimit[RR_data$LowerLimit < 0.5] = 0.5
The following codes will plot the graph below
p = ggplot(data=RR_data, aes(x = Group,y = RiskRatio, ymin = LowerLimit, ymax = UpperLimit ))+ geom_pointrange(aes(col=Group))+ geom_hline(aes(fill=Group),yintercept =1, linetype=2)+ xlab('Group')+ ylab("Risk Ratio (95% Confidence Interval)")+ geom_errorbar(aes(ymin=LowerLimit, ymax=UpperLimit,col=Group),width=0.5,cex=1)+ facet_wrap(~Condition,strip.position="left",nrow=9,scales = "free_y") + theme(plot.title=element_text(size=16,face="bold"), axis.text.y=element_blank(), axis.ticks.y=element_blank(), axis.text.x=element_text(face="bold"), axis.title=element_text(size=12,face="bold"), strip.text.y = element_text(hjust=0,vjust = 1,angle=180,face="bold"))+ coord_flip() p
To add your logscale use scale_y_log10"
. For example, after log scale “half risk” (RR = 0.5) is equidistant from 1 as “double risk” (2.0). Note that, position
can be used to change where you want the axis to appear (in this case I chose top but default is bottom).
p + scale_y_log10(breaks=c(0.5,1,2),position="top",limits=c(0.5,2)) + guides(col = guide_legend(reverse = TRUE))
To make the lines vertical, just take out coord_flip()
out of p
, in this strip.text.y
will not be needed since we don’t have to rotate or adjust the labels of the panels of (conditions) in this case. The strip_position
in the facet_wrap
is also changed to “top”, the y-axes ticks and texts is no more set to blank as shown in the following codes
p = ggplot(data=RR_data, aes(x = Group,y = RiskRatio, ymin = LowerLimit, ymax = UpperLimit ))+ geom_pointrange(aes(col=Group))+ geom_hline(aes(fill=Group),yintercept =1, linetype=2)+ xlab('Group')+ ylab("Risk Ratio (95% Confidence Interval)")+ geom_errorbar(aes(ymin=LowerLimit, ymax=UpperLimit,col=Group),width=0.2,cex=1)+ facet_wrap(~Condition,strip.position="top",nrow=1,scales = "free_x") + theme(plot.title=element_text(size=16,face="bold"), axis.text.x=element_text(face="bold"), axis.title=element_text(size=12,face="bold"))+ scale_y_log10(breaks=c(0.5,1,2)) p
I have explored how to make lattice-like forest plots in R using gplot2
. This can be extended to different estimates/measures and their confidence intervals. Note that you can tweak the graphs by playing with the arguments in the functions.
Related Post
]]>Related Post
]]>Last year Brexit was a very intense topic for all the people around the world not just in UK or EU. Social media was used by people and politicians to prove their points and resultant there were comments, tweets, and posts in support and against of Brexit. Following this example, here we are going to get familiar with tweepy (Library for Twitter analysis) and Aylien (library with auto sentiment classification) by determining the sentiments of Brexit tweets as positive, negative and neutral.
To start with Sentiment Analysis using twitter, we would need to get the details which are required for Tweepy and Aylien in order to work. Below are the steps which should be followed –
Once you are done with twitter setup. We can move on to Aylien –
So, we have got all the details which will be used for Tweepy and Aylien.
We want to analyze the Brexit data from twitter but we don’t have any data to analyze. So, here we are going to create a corpus or collection of tweets which include Brexit talk.
The below piece of code uses api.search()
method to search the Twitter database to find any given query term, here we have given #brexit. Setting up the count to 100 decreases the chances of search query failure.
#search query and store your results in a variable(JSON format) search_results = api.search(q = "#brexit", lang = "en", result_type = "recent", count = 100)
The results retrieved from the above step will be used to create the a text file. We have used the codec to avoid formatting issues when the program reads the JSON search_results and writes utf-8 text.
file = codecs.open("brexit.txt", "w", "utf-8") for result in search_results: file.write(result.text) file.write("\n") file.close()
The brexit.text file is our corpus and we will use to analyze sentiments.
Now that we have a corpus, we need to determine which tweet is positive, negative or neutral. There are many ways to categorize them, e.g – a few people prefer writing their own code by creating two dictionaries consisting of bad and good words. We are not going write lots of code but are going to use
Aylien, which has inbuilt functions to determine the tweet sentiments.
The code below will read the text file line by line and will apply sentiment function. Once the data will be passed through sentiment function, we can create a CSV file which will have two columns – one for tweet text and another for sentiment type(polarity).
# Initialize a new client of AYLIEN Text API client = textapi.Client("", "") with io.open('brexit_tweet_SA.csv', 'w', encoding='utf8', newline='') as csvfile: csv_writer = csv.writer(csvfile) csv_writer.writerow(["Tweet", "Sentiment"]) with io.open("brexit.txt", 'r', encoding='utf8') as f: for brex_tweet in f.readlines(): ## Remove extra spaces or newlines around the text tweet = brex_tweet.strip() ## Reject tweets which are empty so you don’t waste your API credits if len(tweet) == 0: #print('skipped') continue ## Make call to AYLIEN Text API sentiment = client.Sentiment({'text': tweet}) ## Write the sentiment result into csv file csv_writer.writerow([sentiment['text'], sentiment['polarity']])
Now that we have a file with details about tweets and their’s sentiments, we should try and visualize the data.
Visualizing your results is the core part of the analysis, showing sentiment analysis in a simple pie chart will give a clear picture of sentiments(thoughts) of people without even looking at the data. We will be using Pandas and Matplotlib to create the chart.
#open up your csv file with the sentiment results with open('brexit_tweet_SA.csv', 'r', encoding = 'utf8') as brexcsvfile: # Pandas to read the “Sentiment” column, df_brex = pd.read_csv(brexcsvfile) sent_tweet = df_brex["Sentiment"] #use Counter to count how many times each sentiment appears and save each as a variable counter_var = Counter(sent_tweet) positive = counter['positive'] negative = counter['negative'] neutral = counter['neutral'] ## declare the variables for the pie chart, using the Counter variables for “sizes” labels = 'Positive', 'Negative', 'Neutral' sizes = [positive, negative, neutral] colors = ['blue', 'yellow', 'black'] text = "brexit" ## use matplotlib to plot the chart plt.pie(sizes, labels = labels, colors = colors, shadow = True, startangle = 90) plt.title("Sentiment of 100 Tweets about "+ text) plt.show()
The above code will return pie chart as below:
You can see from the above chart that a lot of people don’t have any strong opinion on Brexit but there is less number of people who said a negative thing about Brexit.
The sentiment analysis became easier by using Tweepy and AYLIEN, imagine doing this analysis manually. While doing this exercise I noticed few things, e.g, the tweets which were not actually against of Brexit but sounded negative were categorized as negative. so, we will need to do some preprocessing of data. This exercise was meant to get people familiar with Tweepy and AYLIEN, the code for the exercise can be found on my Github. To complete this exercise, I have taken reference from aylien website blog page and twitter developer site.
Related Post
]]>