DataScience+

Linking R and Python to retrieve financial data and plot a candlestick

Fabian Scheler — Mon, 18 Apr 2022 01:02:48 +0000

Are you interested in guest posting? Publish at DataScience+ via your RStudio editor.

Tags

I am way more experienced with R than with Python and prefer to code in this language when possible. This applies, especially when it is about visualizations. Plotly and ggplot2 are fantastic packages that provide a lot of flexibility. However, every language has its limitations, and the best results stem from their efficient combination.

This week, I created the candlestick below, and I think it’s an excellent case study to illustrate a few things:

How to download financial data from investing.com using the investpy package in Python

How to efficiently combine the capabilities of Python and R deploying the reticulate package

How to construct a nicely formatted candlestick chart with ggplot2, ggthemes and two simple custom functions

How to export the result in different image formats, including high-resolution Scalable Vector Graphics (SVG)

The Python part

Let’s start with the Python code required. First, we need to install the investpy package using pip to run the simple function below. Investpy is a fantastic and very powerful wrapper around the public API of the investing.com page. It allows the retrieval of end of day price data for a wide range of financial instruments, including stocks, bonds, ETFs, mutual funds, indices, currencies, commodities and cryptocurrencies, as well as the download of selected meta-data. Detailed documentation can be found here or in pdf format under this link. Save the function defined below in a python script.

#pip install investpy

def get_fx_cross_investpy(currency_cross,st_date,ed_date):    
    import investpy
    data = investpy.get_currency_cross_historical_data(currency_cross=currency_cross, from_date=st_date, to_date=ed_date)
    return(data)

The R part

To use the previously defined Python function in R and to subsequently plot the data, we require the following four packages that can be installed easily from CRAN.

install.packages("reticulate")
install.packages("ggplot2")
install.packages("ggthemes")
install.packages("scales")

Defining a pretty theme

The ggthemes package comes with a few nice default themes for ggplot2 graphics. So you can, for instance, replicate the famous design of the Economist or the appearance of typical Stata charts. However, it is also possible to adapt these themes and create your unique default layout. I demonstrate this below for my standard formatting. The function defined here is later used in the candlestick function.

theme_aq_black_default_font<-
  function (base_size = 12, base_family = "") 
  {
    library(ggplot2)
    library(ggthemes)
    library(scales)
    col_aq2<-as.character(c("#04103b","#dd0400","#3b5171","#5777a7","#969696","#BDBDBD","#D9D9D9","#F0F0F0"))
    
    theme_hc(base_size = base_size, base_family = base_family) %+replace%

The candlestick function

Candlesticks are widely used in the visualization of price data and technical analysis. It allows viewers to quickly gauge the significance of market moves and analyze potential resistance levels or extraordinary price jumps that may be reverted in the future. To construct the daily candlestick displayed above, we require daily opening and closing prices as well as intraday highs and lows. Fortunately, this is all available on investing.com and can be retrieved as a handy data frame with our function defined above.

ggplot_candlestick<-function(df,width=0.9,chart_title,chart_subtitle)
{
	library(ggplot2)
  df$Date<-row.names(df)
  df$Date<-as.Date(df$Date,"%Y-%m-%d")
  df$chg  df$Open, "dn", "up")
  cols<-as.character(c("#04103b","#dd0400","#3b5171","#5777a7","#969696","#BDBDBD","#D9D9D9","#F0F0F0"))
  
  p<-
    ggplot(data=df,aes(x=as.Date(Date), y=High))+
    geom_linerange(aes(ymin=Low, ymax=High)) +
    geom_rect(aes(xmin = Date - width/2 * 0.9, xmax = Date + width/2 * 0.9, ymin = pmin(Open, Close), ymax = pmax(Open, Close), fill = df$chg)) + 
    scale_fill_manual(values = c("up" = "darkred", "dn" = "darkgreen"))+
    scale_colour_manual(values = cols)+
    theme_aq_black_default_font(base_size=18)+
    labs(color='')+
    labs(title=chart_title,subtitle=chart_subtitle,x ="")+
    labs(caption = paste0('Source: DataScience+, Investing.com  ', Sys.Date()))+
    guides(colour = guide_legend(nrow = 1))+
    scale_x_date(labels = date_format("%y/%m"))+
    theme(legend.position = "none",legend.margin=margin(-20,-20,-20,-20),legend.box.margin=margin(0,0,30,0))+
    ylab("")+
    theme(plot.margin=margin(l=5,r=20,b=5,t=5))

  return(p)
}

Plot the data and export the graphic

Last but not least, let’s combine all these modules and execute them step by step. Once we have loaded our Python function employing the reticulate package, we can use it in R to retrieve the financial data from investpy. We can subsequently use our previously defined R functions to create the candlestick plot. The plot can then be exported easily as a PNG or SVG graphic utilizing ggsave.

# Load the python function and retrieve the financial data
library(reticulate)
source_python("C:/Users/Fabian/Desktop/get_rates_investing.com.py")
df<-get_fx_cross_investpy("USD/RUB",'01/01/2022','01/05/2022')   

# Use the R functions and plot the data
p<-ggplot_candlestick(df,chart_title="Following its crash, the Russian Ruble rebounded sharply",chart_subtitle="USD/RUB exchange rate")
p

# Save the plot
target_folder<-"C:/Users/Fabian/Desktop/"
ggsave(file=paste0(target_folder,"candlestick_usd_rub.svg"), plot=p, width=9, height=5)
ggsave(file=paste0(target_folder,"candlestick_usd_rub.png"), plot=p, width=9, height=5)

Related Post

An R alternative to pairs for -omics QC

Nicholas Carruthers — Mon, 28 Mar 2022 01:23:19 +0000

Are you interested in guest posting? Publish at DataScience+ via your RStudio editor.

Introduction

The Problem: I've got a couple of problems with the commonly used “pairs” plot in R for quality control in -omics data. (1) It's not that space-efficient since it only uses half the space for datapoints and (2) the scatterplot isn't very informative. When I look at those scatter plots it's hard to tell anything about the spread of the data or any normalization problems. This is particularily true for proteomics data where the high dynamic range can obscure lower abundance points.

The Solution: A panel of MA plots (Minus-Average). The MA plot shows fold-change vs average intensity for a pair of samples. It lets you see difference between sample groups as fold-change which I think is a useful unit for comparison and visualizes normalization problems. Rather than plot each against each we will only compare samples between groups to save space.

This goes along with a previous post of mine that attempts to convince biologists of the value of putting tables of data into tidy format. This method takes advantage of pivoting data to succintly generate a panel of MA plots

suppressPackageStartupMessages(library(tidyverse))
library(GGally)
library(DEFormats)

Set up the data

I'll start with simulated data that will resemble a gene expression study. A proteomics dataset would be similar. The dataset will have 8 samples, half of them treated, half control. 7 of the samples will approximately the same but Sample 4 will have a 3-fold increase compared to the rest to illustrate how MA-plots help identify problems with normalization.

counts <- simulateRnaSeqData(n = 5000, m = 8)
counts[, 4] <- counts[, 4] *  3

targets <- data.frame(sample = colnames(counts), group = c(rep("control", 4), rep("treated", 4)))

The ggpairs function from the GGally package does a decent job of the pairs plot.

ggpairs(data.frame(counts))

The pairs plot tells us something about the data. The correlation is nice to have and if any sample was wildly different from the others it would show up in the scatter plot. Still I don't think it conveys the information very efficiently.

MA plot panel

Typically I would start by pivoting all the count data to a single columns and joining in the metadata. But I need to associate control and treated data for each gene for each sample so the usual method won't work. It took me a while to fall on the solution: we have to pivot the control and treated samples separately. So for each gene we will have a control sample name, a treated sample name and control and treated count data. Those can be used to calculate Fold-change and intensity.

control_samples <- targets$sample[targets$group == "control"]
treated_samples <- targets$sample[targets$group == "treated"]

data.frame(counts) %>%
  rownames_to_column("gene") %>%
  pivot_longer(all_of(control_samples), names_to = "control_sample", values_to = "control_count") %>%
  pivot_longer(all_of(treated_samples), names_to = "treated_sample", values_to = "treated_count") %>%
  mutate(FC = treated_count / control_count) %>%
  mutate(Intensity = (treated_count + control_count) / 2)
## # A tibble: 80,000 x 7 
##    gene  control_sample control_count treated_sample treated_count    FC Intensity
##                                                
##  1 gene1 sample1                  103 sample5                   71 0.689      87  
##  2 gene1 sample1                  103 sample6                   79 0.767      91  
##  3 gene1 sample1                  103 sample7                   76 0.738      89.5
##  4 gene1 sample1                  103 sample8                  118 1.15      110. 
##  5 gene1 sample2                   82 sample5                   71 0.866      76.5
##  6 gene1 sample2                   82 sample6                   79 0.963      80.5
##  7 gene1 sample2                   82 sample7                   76 0.927      79  
##  8 gene1 sample2                   82 sample8                  118 1.44      100  
##  9 gene1 sample3                   89 sample5                   71 0.798      80  
## 10 gene1 sample3                   89 sample6                   79 0.888      84  
## # ... with 79,990 more rows

All that's left to do now is plot. Facet_grid will let us split the samples up.

data.frame(counts) %>%
  rownames_to_column("gene") %>%
  pivot_longer(all_of(control_samples), names_to = "control_sample", values_to = "control_count") %>%
  pivot_longer(all_of(treated_samples), names_to = "treated_sample", values_to = "treated_count") %>%
  mutate(FC = treated_count / control_count) %>%
  mutate(Intensity = (treated_count + control_count) / 2) %>%
  ggplot(aes(x = Intensity, y = FC)) +
  geom_point(alpha = 0.5, na.rm = TRUE) +
  scale_x_continuous(trans = "log10") +
  scale_y_continuous(trans = "log2", breaks = 2^seq(-4, 4, 2)) +
  geom_hline(yintercept = 1) +
  labs(x = "Intensity", y = "Fold Change, treated vs control") +
  facet_grid(rows = vars(treated_sample), cols = vars(control_sample))

The change in abundance in sample 4 shows up much more clearly now. This isn't a common way to plot the data so it might require some explaining to your colleagues but worth the effort in my opinion.

Related Post

Visualizing economic data with pretty worldmaps

Fabian Scheler — Tue, 22 Mar 2022 18:56:14 +0000

Are you interested in guest posting? Publish at DataScience+ via your RStudio editor.

Tags

Choropleths are a nice tool for the visualization of geographic data and with R and Python, their creation can be pretty effortless, especially when clean data is readily available. Fortunately, a lot of economic datasets can be loaded from the World Bank Open Data platform. Nevertheless, making the visualization look nice can be a bit cumbersome. For this reason, I have created a small, integrated function that is handling the data download, adjustment, and visualization all in one. The output can easily be exported as high-quality vector-graphic using ggsave.

To replicate the example, simply follow these three steps.

1) Install all required packages

install.packages("rnaturalearth")
install.packages("rnaturalearthdata")  
install.packages("WDI")
install.packages("lubridate")
install.packages("dplyr")  
install.packages("ggplot2")
install.packages("RColorBrewer")
install.packages("ggplot2")

2) Load the function

#Download and visualization function
get_and_plot_wb_data<-function(sel_indicator="NY.GDP.PCAP.KD",
                               log="Yes",
                               midpoint="Mean",
                               color_scheme="Straight",
                               start_yr = year(Sys.Date())-2,
                               end_yr = year(Sys.Date()),
                               legend_pos="Bottom",
                               charttitle="GDP per Capita in US$",
                               scale_factor=1)
{
#Load required packages
library(rnaturalearth)
library(rnaturalearthdata)  
library(WDI)
library(lubridate)
library(dplyr)  
library(ggplot2)
library(RColorBrewer)
library(ggplot2)

#Create color scheme
cols_gr<-c("#632523", "#953735",  "#C3D69B","#77933C", "#4F6228")
cols_gr = colorRampPalette(cols_gr)(5)

#Get all countries and countrycodes
world <- ne_countries(scale = "medium", returnclass = "sf")
class(world)

#Load data using the worldbank API
gdp<-WDI(
  country = "all",
  indicator = sel_indicator,
  start = start_yr,
  end = end_yr,
  extra = FALSE,
  cache = NULL,
  latest = NULL,
  language = "en"
)

names(gdp)[1]<-"iso_a2"
names(gdp)[names(gdp)==sel_indicator]<-"ret"
gdp<-gdp[order(gdp$country,gdp$year),]
print(max(gdp$year))
gdp% group_by(iso_a2) %>% do(tail(., n=1))
  
  
world<-merge(world,gdp,by="iso_a2",all=F)

#Remove some unnecssary elements
plain <- theme(
  panel.background = element_rect(fill = "white"),
)

world$ret<-round(world$ret/scale_factor,2)
#Choose appropriate breakpoint
if(midpoint=="Median")
{
  midpoint<-median((world$ret),na.rm=T)
}else{
  if(midpoint=="Mean")
  {
    midpoint<-mean((world$ret),na.rm=T)  
  }else{
    midpoint<-midpoint
  }  
}

#Red to green or green to red
if(color_scheme=="Inverted")
{
  col_low<-"#276419"
  col_high<-"#8E0152"
}else{
  col_low<-"#8E0152"
  col_high<-"#276419"
}

#Plot map with ggplot

p9<-
  ggplot(data = world) +
  geom_sf(aes(fill = (ret))) +
  ggtitle(charttitle) +
  #theme(legend.position = legend_pos,legend.margin=margin(-10,-10,-10,-10),legend.box.margin=margin(-100,800,-50,-50))+
  plain 
  
#Use log scale for skewed data
if(log=="Yes")
{  
  p9<-p9+  
    scale_fill_gradient2(
      trans = "log",
      low = col_low,
      mid = "white",
      high = col_high,
      midpoint = log(midpoint),
      space = "Lab",
      na.value = "grey50",
      guide = "colourbar",
      aesthetics = "fill",
      name=""
    )
}else{
  p9<-p9+  
    scale_fill_gradient2(
      low = col_low,
      mid = "white",
      high = col_high,
      midpoint = midpoint,
      space = "Lab",
      na.value = "grey50",
      guide = "colourbar",
      aesthetics = "fill",
      name=""
    )
}

res_list<-list("dataset"=world,"visualization"=p9)

return(res_list)

}

3) Run the function and save the plot

#Run the function
res_list<-get_and_plot_wb_data(sel_indicator="NY.GDP.PCAP.KD",midpoint="Median",log="Yes",legend_pos="bottom",charttitle="GDP per Capita in US$",scale_factor=1)
res_list$visualization

#Export the result
target_folder<-"C:/your_path/"
ggsave(file=paste0(target_folder,"gdp_per_capita.svg"), plot=res_list$visualization, width=14, height=7)

Related Post

How to Incorporate ML.Net With Algorithmic Trading

Zadhid Powell — Fri, 18 Mar 2022 00:58:25 +0000

Are you interested in guest posting? Publish at DataScience+ via your RStudio editor.

Tags

As you search about ML.Net on Google, you’d find much content about working with that; but almost all do the identical machine learning tasks available on the model builder. So, I thought, why not do something a little more complex using that.

In this post, I want to share my primary idea of designing a simple trading strategy using the model builder tool. I’ve spent time establishing a basic logic that works. Note that this context’s priority is to walk through MS machine learning with algorithmic trading; it’s not essential whether the following strategy is profitable. (If there were an accurate and profitable algorithmic trading strategy, it wouldn’t be available for free!)

You’ll get an inspiring prospect, I hope. Please Keep exploring with me.

What You’ll Read:

Why do I think this project is worth considering?

A step-by-step roadmap to get the ball rolling:
Step 1: Choose a programming language.
Step 2: Determine what field/financial market your bot will work upon.
Step 3: Select a server.
Step 4: Determine your trading strategy.
Step 5: Deploy your strategy to your program.
Backtest your strategy

Why Do I Think Incorporating ML.Net With Algorithmic Trading Is Worth Considering?

Machine learning is one of those areas of programming which is very capable of invitations and creativity. But, are you limited to any particular language like Python or R to develop either AI or ML projects? Who says that?

Nowadays, many developers have started learning to program with C#. But, if you’re one of them, you’d probably have heard that C# is not the best choice to start programming or it’s just useful for windows applications which is a wrong belief.
So why did I choose machine learning, and why a trader bot project? I have two main reasons for that:

1. Trading is an everyday activity in almost everyone’s life; everyone buys and sells stuff. The difference is just about that stuff, and someone trades stocks, someone else buys and sells cars, or anything else. I understand that coding an advanced trader bot requires the knowledge of technical analysis and other related skills, but at this stage, I’m talking about trading.

Trading is something understandable for almost every single person. So I don’t want the primary bot to be too complicated; it is supposed to buy low-priced assets and sell them at their highest prices. Just that simple!

2. As you create a bot, it’s crucial to teach that trading bot how to do functions. I want to utilize ML.Net at this level. I will not use any strange machine learning tasks but show that even with such a level of simplicity, it’s possible to do something more beneficial than just predicting NY taxi fares!

Besides, that’s a project having a huge potential for development and adding new features which challenge your creativity.

There is a step-by-step guide on how to solve the puzzle in the following parts.

Step 1: Choose a Programming Language

Although this step seems routine or even a little paradoxical in this particular article, it’s important enough to get mentioned. It’s up to you what programming language works for you more comfortably. When it comes to machine learning and AI, almost everyone believes that Python is the best choice.

But as ML.Net Has been designed for .NETdevelopers and supports C# and F#, I’ve written my code snippets in C#. However, you can incorporate ML.Net functions in Python via NimbusML.

Step 2: Determine What Field/Financial Market Your Bot Will Work Upon

This part is often skipped in many robot building tutorials out there, while it’s as crucial as required programming knowledge! You have to decide what asset you’re going to trade. (For example, stocks, currencies, and cryptos).

If you’d ask my opinion, I highly recommend fiat currencies, as they follow logical trends and are easier to predict their behavior. Such an approach will decrease your risks due to fewer fluctuations in comparison with other types of assets.
Let’s try to establish a strategy based on the EUR/USD chart, later in this article.

Step 3: Select a Server

You require a reliable server to call and send your desired exchange/broker API requests.
Of course, in the stage of building your robot, you can utilize your computer as a server (or use a free host). But in the main stage of the robot’s operation, which has to be on a 24 hours basis, your local server will no longer be a suitable choice. Therefore, I have two recommendations for you:
1. Utilize Raspberry Pi as a server.
2. Use cloud hosting.

Executing a bot via Raspberry Pi seems to be an exceptional idea. Try it if you’re interested. Although, most people use cloud hosting services like Azure and AWS. You can also use MetaTrader 5 VPS. It’s an efficient distributed computing service that I’ve been using for a few similar projects.

Step 4: Determine Your Trading Strategy.

In the beginning, let’s try to establish a simple strategy that works. Here, I chose regression analysis. It’s pretty simple; pick the asset chart data and draw a trend line over a particular period. At this stage, the model builder tool comes into the game and helps you create a trend line (the same regression line) with high accuracy. Of course, you’re presumably familiar with linear regression.

A linear regression graph is something like this sample schema. (Sorry, I’m not very good at painting.) Though the prices have many fluctuations, the total trend usually matches a trend line like this drawn one. Our strategy is based on this point. Our bot will place buy/sell orders at points where the price would deviate significantly from the regression line. How could the bot understand when it’s time to place orders?

My idea is to get real-time market prices via the exchange or broker’s API and then compare it with the regression predictor model’s output. Plus, it’s a good idea to utilize the RSI indicator to confirm our trading signals.

Step 5: Deploy Your Strategy to Your Program

As I’m not looking forward to rebuilding the wheel, I don’t define any class or method for my bot from scratch. I suggest you utilize trading software DLL libraries instead. Likewise, several solutions are available, and each one has its benefits. The one choice that I prefer to use is MetaTrader 5, as it supports object-oriented programming. Before getting into the code sample, let’s get back to the model builder and class library for trading functions.

Create a .Net Core console app. Click right on your project’s solution, manage NuGet packages, and install these two packages: Microsoft.ML and MQL4CSharp. Let’s start with the relatively simple task, creating a regression model utilizing past price data of EUR/USD. You can use some known online data sources like Kaggle Datasets, but I operate differently to get the historical market data. If you like to learn about that, click here.

Working with ML.Net Model Builder is like a piece of cake, meanwhile explained entirely on Microsoft docs. It autogenerates a pipeline based on the model inputs to predict the next price. I set the date and time as the features (model’s inputs) and the close (or lastest) price on the corresponding date as the label (model’s output). Then, in advance data options, change the other columns’ purposes to ignore.

Now, let’s see what kind of use the MQL4CSharpclass library provides for your project. It contains all MQL5 language (MetaTrader 5 specific language, similar to C++) functions converted to C#. I selected a few methods from the DLL class library for this test project. Ready to get into the code template?

First, add a new class to the project and inherit that from the MQLBase class. Then in the OnTick() method’s block, load and execute the model.

public override void OnTick()
 {
 //Load and use the model to make predictions.
     var sampleData = new BOtModel.ModelInput()
    {
       Col0 = @"" ,
       Col1 = @"" ,
    };
     var result = BOtModel.Predict(sampleData);
 }

Now execute the bot’s logic like this:

public override void OnTick()
 {
   var sampleData = new BOtModel.ModelInput()
 {
    Col0 = @"" ,
    Col1 = @"" ,
  };
    var result = BOtModel.Predict(sampleData);
 }
            
//Define these two variables based on your customized parameters. 

double rsi = iRSI(string symbol, int timeframe, int period, int applied_price, int shift);
double closeprice = iClose(string symbol, int timeframe, int shift);
       
//Define conditional statements to place buy/sell orders. 
     if (rsi > 80 && Convert.ToDouble(result) < closeprice)
  {

//Sell order
OrderSend(string symbol, int cmd, double volume, double price, int slippage, 
double stoploss, double takeprofit, string comment, int magic, DateTime expiration, MQL4CSharp.Base.Enums.COLOR arrow color);
   }
      else if (rsi  closeprice)
   {

//Buy order
OrderSend(string symbol, int cmd, double volume, double price, int slippage, 
double stoploss, double takeprofit,string comment, int magic, DateTime expiration, MQL4CSharp.Base.Enums.COLOR arrow color);
   }
  }

The required parameters for the methods are explained briefly at the following links:
1. iRsi()
2. iClose()
3. OrderSend()

If you are not familiar with financial markets, you might ask what the RSI indicator is? Thoroughly, it’s a technical indicator that allows you to know whether an asset is overbought or oversold so that you can realize when the price is intended to pull back or rise.

It’s is one of the most crucial actions after creating your Trading Robot. Accessing the historical data is, as I mentioned in the past paragraphs. I do not intend to delve into this matter in this post, and I will thoroughly explain backtesting in one of my future articles.

Meanwhile, there is something that I prefer to mention here. When getting backtests, it’s essential to make accurate calculations that require spending a lot of time and energy manually.

The algorithmic trading platforms can facilitate your efforts at this level as well. One of my favorite services related to this issue is the MQL5 Cloud Network, which lets you utilize too many local PC CPUs power available on the network to enhance your calculations either speed or accuracy.

Summary

So far now, I have created a regression model based on the EUR/USD past prices data and then coded the bot’s functionality logic consuming that model. Nevertheless, some points can still be added, like connection to the data banks, saving the robot’s trade functions records, designing a user-friendly dashboard, and more.

In my future posts, I’ll expand this content and explain further details on developing and adding complementary features to the bot.

Thanks for reading this article, and I hope it was helpful for you.

Related Post

How I selected my starting word for Wordle using simulations and R

Bernardo Lares — Thu, 24 Feb 2022 02:19:34 +0000

Are you interested in guest posting? Publish at DataScience+ via your RStudio editor.

Tags

Wordle has been around for some time now and I think it’s not quite necessary to explain what it is and how to play it. (If I lost you there, please read more about it (and play) here). I’m not the best for wording games, thus I don’t find them quite entertaining. But, thinking of algorithmic ways to find puzzle solutions faster using R is! That’s why I started to think of random ideas regarding Wordle: for people who have played a lot, what’s their guessing word distribution look like? Are there better or worse words to start with? Are there significant more relevant letters that would be useful to guess your first try?

I’ve seen some people answer similar questions after I started thinking on the matter, especially on most frequent letters by position (post), and a way to play and replicate the game using R (post).

Keep in mind the “winner starting word” you find depends on:
1. the words you pick to evaluate as possible best words,
2. the words you are trying to predict and test toward,
3. the valid words randomly picked to guess, based on the set of seeds picked to select those words

Yet, it gives us a winner solution.

1. Install and load `lares`

Run install.packages("lares") and then, load it with library("lares"). No more packages needed from now on.

2. Select your starting point

Let’s pick some “good words” to see which ones are the best. I won’t get into details but I set a couple of rules based on letter by position frequency to set these initial words. There are 48 words that comply with these filters. I’m using some old Scrabble functions I previously developed for the library for this purpose.

some_words <- scrabble_words(
  language = "en", # for English words
  tiles = "ESAORILTNU", # 67% cumulative letters
  force_n = 5, # comply Wordle universe of 5 letter words
  force_start = "s", # 16% words start with S
  force_str = "e")$word # most frequent letter
sort(toupper(some_words))
 [1] "SAINE" "SALET" "SALUE" "SANER" "SAUTE" "SENOR" "SENTI" ...

We could sample some of them, manually pick those we like most, or simply use all if you’re patient enough to run the simulations:
best_words <- toupper(some_words)

Now, I picked 10 random words we are going to guess, starting from the “best words” we picked. The more we use, the better it’ll represent the universe of words (which is about 13K 5 letter words).

set.seed(2) # For reproducibility (same results)
test_words <- toupper(sample(wordle_dictionary("en"), 10)); test_words # Words to guess randomly picked
  [1] "BOZOS" "RESET" "TROWS" "SALTS" "JIBES" "YIRKS" "DURST" "SITES" "PENGO" "GARRE"

Finally, how many different random picking scenarios we’d like to test on each combination of best words and test words:
seeds <- 20 # Number of different random picks to check distribution

So basically expect for 480 iterations in total if you use these same settings. For me, it took about 30 minutes using my personal computer.
print(length(best_words) * length(test_words), seeds)

3. Run simulations

Now that we already set our starting point, let’s run the simulations.

results <- temp <- NULL
tic("wordle_loop")
for (word in best_words) {
  cat(sprintf("- Seed word: %s [%s/%s]\n", word, which(word == best_words), length(best_words)))
  for (objective in test_words) {
    cat(sprintf("Guess %s [%s/%s] <= %s simulations: ", objective, which(objective == test_words), length(test_words), seeds))
    temp <- wordle_simulation(word, objective, seed = 1:seeds, print = FALSE, quiet = TRUE)
    results[[word]][[objective]] <- temp
    cat(signif(mean(sapply(temp, function(x) x$iters)), 4), "\n")
  }
}
toc("wordle_loop")
- Seed word: SALUE [1/48]
Guess BOZOS [1/10] <= 20 simulations: 6.95 
Guess RESET [2/10] <= 20 simulations: 5.5 
Guess TROWS [3/10] <= 20 simulations: 6.4 
Guess SALTS [4/10] <= 20 simulations: 4.7 
Guess JIBES [5/10] <= 20 simulations: 7.05 
Guess YIRKS [6/10] <= 20 simulations: 6.55 
Guess DURST [7/10] <= 20 simulations: 5.05 
Guess SITES [8/10] <= 20 simulations: 6.95 
Guess PENGO [9/10] <= 20 simulations: 5.3 
Guess GARRE [10/10] <= 20 simulations: 6.05

- Seed word: SILEN [2/48]
Guess BOZOS [1/10] <= 20 simulations: 6.45 
...

4. Gather results and check winners

Let’s check the sorted results to see the best and worst words.

best_means <- lapply(results, function(a) sapply(a,function(x) mean(sapply(x, function(x) x$iters))))
sort(sapply(best_means, mean))
STIRE SOARE SNARE STARE STORE STALE STEAL SNORE
5.205 5.240 5.325 5.325 5.365 5.380 5.385 5.400 ...

hist(sapply(best_means, mean), breaks = 30)

There’s a small range on these convergence means: (5.205, 5.890). That means that all these words are similarly good or bad compared with each other. But, notice we can actually easily pick or split the best from the worst words with this methodology.

To understand this point a bit more, let’s study a random benchmark: I picked 25 random words (not selected with the “best” words criteria) as my new best_words <- toupper(sample(some_words, 25)). Then, re-ran all the code with the same parameters and test words, for a total of 250 iterations, and got the following distribution. (Note: it took 18 minutes this time)

And theory confirmed. We did pick our first best words correctly given that the results given random words are really worse. Now the range covers a convergence mean of (5.700, 6.585). Notice that the best random words are not quite within the best words range but for a few lucky cases. And the best of the best words converge ~1.4 guesses before the worst of the random words. So we can actually do something about it and use better words to start our game!

Final comments, asks, and considerations

– There are other wordle_* functions in the package you can use as dictionary, to actually play, and to run simulations. Check out the nice colored results it prints. In the examples there is a small demo on how you can play with a random un-known word without limits.

– You probably won’t have a great difference on using one of the best picked words unless you play thousands of times: it’s more a matter of big numbers (and being smart on picking the right available words when guessing).
– You could try to re-run this analysis with a wider range of best words or test words to see if they can be improved. Note that the best words are the ones that converge sooner, thus lower iterations means.
– Expect to have these wordle_* functions updated in CRAN in a month or so (with nicer plots as well).
– Now, from the second input word onwards, it’s up to your picking skills. Feel free to check which words are left available using the scrabble_words() function.

Bonus benchmarks with different “best words” criteria

I ran one more experiment, using parallel calculations on 24 cores (which returned results in ~10min for 96 words and 40 iterations scenario), and got to the same conclusions as before, regardless of some outliers. FYI: “ESAORILT” are the most frequent letters, in order.
1) (all) 24 words containing E, starting with S, using any of “ESAORILT” letters
2) 96 good words containing E, using any of “ESAORILT” letters
3) 96 random words

If you are wondering what those lower words are (remember the randomness factor), you’d get SOARE, ALERT, and RAILE. And the worst ones? LOGOI, KEVEL, and YAFFS.

Related Post

Forecast using Arima Model in R

Wahyuddin S — Mon, 14 Feb 2022 19:51:20 +0000

Are you interested in guest posting? Publish at DataScience+ via your RStudio editor.

ARIMA Modeling

AutoRegressive Integrated Moving Average

Install Packages

library(readxl)
library(lmtest)
library(forecast)
library(FitAR)
library(fUnitRoots)

Import Data Set

table2 <- read_excel("Datum/1 SCOPUS/2022/Feb-01/Data/table2.xlsx",sheet = "Sheet2")
View(table2)

summary(table2)
    avg_jual        avg_beli    
 Min.   : 8808   Min.   : 8766  
 1st Qu.:13498   1st Qu.:13480  
 Median :14190   Median :14078  
 Mean   :13979   Mean   :13800  
 3rd Qu.:14705   3rd Qu.:14257  
 Max.   :15753   Max.   :15020

tsJual = ts(table2$avg_jual, start = c(2017,1), frequency = 12)
plot(tsJual)
tsBeli = ts(table2$avg_beli, start = c(2017,1), frequency = 12)
plot(tsBeli)

components.tsJual = decompose(tsJual)
plot(components.tsJual)
components.tsBeli = decompose(tsBeli)
plot(components.tsBeli)

urkpssTest(tsJual, type = c("tau"), lags = c("short"),use.lag = NULL, doplot = TRUE)
urkpssTest(tsBeli, type = c("tau"), lags = c("short"),use.lag = NULL, doplot = TRUE)

sstationary_Jual = diff(tsJual, differences=1)
plot(tsstationary_Jual)
tsstationary_Beli = diff(tsBeli, differences=1)
plot(tsstationary_Beli)
acf(tsJual,lag.max=34)
acf(tsBeli,lag.max=34)
Pacf(tsJual,lag.max=34)
Pacf(tsBeli,lag.max=34)

timeseriesseasonallyadjusted_Jual <- tsJual- components.tsJual$seasonal
tsstationary_Jual <- diff(timeseriesseasonallyadjusted_Jual, differences=1)
timeseriesseasonallyadjusted_Beli <- tsJual- components.tsBeli$seasonal
tsstationary_Beli <- diff(timeseriesseasonallyadjusted_Beli, differences=1)

plot(timeseriesseasonallyadjusted_Beli)
plot(timeseriesseasonallyadjusted_Jual)

acf(tsstationary_Jual, lag.max=34)
pacf(tsstationary_Jual, lag.max=34)
acf(tsstationary_Beli, lag.max=34)
pacf(tsstationary_Beli, lag.max=34)

fitARIMA_Jual <- arima(tsJual, order=c(1,1,1),seasonal = list(order = c(1,0,0), period = 12),method="ML")
fitARIMA_Beli <- arima(tsBeli, order=c(1,1,1),seasonal = list(order = c(1,0,0), period = 12),method="ML")

coeftest(fitARIMA_Jual) 
z test of coefficients:

      Estimate Std. Error z value Pr(>|z|)
ar1  -0.021344   1.837953 -0.0116   0.9907
ma1   0.083561   1.842706  0.0453   0.9638
sar1  0.072859   0.274394  0.2655   0.7906

coeftest(fitARIMA_Beli) 
z test of coefficients:

       Estimate Std. Error z value Pr(>|z|)
ar1   0.0032167  0.6907733  0.0047   0.9963
ma1   0.0509199  0.7058832  0.0721   0.9425
sar1 -0.0026367  0.3522116 -0.0075   0.9940

fitARIMA_Jual
Call:
arima(x = tsJual, order = c(1, 1, 1), seasonal = list(order = c(1, 0, 0), period = 12), 
    method = "ML")

Coefficients:
          ar1     ma1    sar1
      -0.0213  0.0836  0.0729
s.e.   1.8380  1.8427  0.2744

sigma^2 estimated as 472215:  log likelihood = -373.76,  aic = 755.51

fitARIMA_Beli
Call:
arima(x = tsBeli, order = c(1, 1, 1), seasonal = list(order = c(1, 0, 0), period = 12), 
    method = "ML")

Coefficients:
         ar1     ma1     sar1
      0.0032  0.0509  -0.0026
s.e.  0.6908  0.7059   0.3522

sigma^2 estimated as 457012:  log likelihood = -372.95,  aic = 753.91

confint(fitARIMA_Beli)
 2.5 %    97.5 %
ar1  -1.3506740 1.3571074
ma1  -1.3325858 1.4344256
sar1 -0.6929589 0.6876854

confint(fitARIMA_Jual)
2.5 %    97.5 %
ar1  -3.6236644 3.5809772
ma1  -3.5280769 3.6951992
sar1 -0.4649435 0.6106622

acf(fitARIMA_Beli$residuals)
acf(fitARIMA_Jual$residuals)

boxplot(fitARIMA_Jual$residuals,k=2,StartLag=1)
LjungBoxTest(fitARIMA_Jual$residuals,k=2,StartLag=1)
boxplot(fitARIMA_Beli$residuals,k=2,StartLag=1)
LjungBoxTest(fitARIMA_Beli$residuals,k=2,StartLag=1)

qqnorm(fitARIMA_Jual$residuals)
qqline(fitARIMA_Jual$residuals)
qqnorm(fitARIMA_Beli$residuals)
qqline(fitARIMA_Beli$residuals)

arima(tsJual)
Call:
arima(x = tsJual)

Coefficients:
       intercept
      13978.5625
s.e.    177.7277

sigma^2 estimated as 1516180:  log likelihood = -409.67,  aic = 823.34

arima(tsBeli)
Call:
arima(x = tsBeli)

Coefficients:
       intercept
      13800.3958
s.e.    165.1939

sigma^2 estimated as 1309870:  log likelihood = -406.16,  aic = 816.32

auto.arima(tsJual, trace=TRUE)
ARIMA(2,1,2)(1,0,1)[12] with drift         : Inf
 ARIMA(0,1,0)            with drift         : 750.4713
 ARIMA(1,1,0)(1,0,0)[12] with drift         : 755.0944
 ARIMA(0,1,1)(0,0,1)[12] with drift         : 755.0899
 ARIMA(0,1,0)                               : 749.8579
 ARIMA(0,1,0)(1,0,0)[12] with drift         : 752.7432
 ARIMA(0,1,0)(0,0,1)[12] with drift         : 752.7402
 ARIMA(0,1,0)(1,0,1)[12] with drift         : Inf
 ARIMA(1,1,0)            with drift         : 752.7152
 ARIMA(0,1,1)            with drift         : 752.7136
 ARIMA(1,1,1) with drift         : Inf

 Best model: ARIMA(0,1,0)                               

Series: tsJual 
ARIMA(0,1,0) 

sigma^2 = 475492:  log likelihood = -373.88
AIC=749.77   AICc=749.86   BIC=751.62

auto.arima(tsBeli, trace=TRUE)
ARIMA(2,1,2)(1,0,1)[12] with drift         : Inf
 ARIMA(0,1,0)            with drift         : 748.9614
 ARIMA(1,1,0)(1,0,0)[12] with drift         : 753.5961
 ARIMA(0,1,1)(0,0,1)[12] with drift         : 753.5933
 ARIMA(0,1,0)                               : 748.1365
 ARIMA(0,1,0)(1,0,0)[12] with drift         : 751.2314
 ARIMA(0,1,0)(0,0,1)[12] with drift         : 751.2295
 ARIMA(0,1,0)(1,0,1)[12] with drift         : 753.6222
 ARIMA(1,1,0)            with drift         : 751.2183
 ARIMA(0,1,1)            with drift         : 751.2173
 ARIMA(1,1,1) with drift         : Inf

 Best model: ARIMA(0,1,0)                               

Series: tsBeli 
ARIMA(0,1,0) 

sigma^2 = 458392:  log likelihood = -373.02
AIC=748.05   AICc=748.14   BIC=749.9

fitARIMA_Jual
Call:
arima(x = tsJual, order = c(1, 1, 1), seasonal = list(order = c(1, 0, 0), period = 12), 
    method = "ML")

Coefficients:
          ar1     ma1    sar1
      -0.0213  0.0836  0.0729
s.e.   1.8380  1.8427  0.2744

sigma^2 estimated as 472215:  log likelihood = -373.76,  aic = 755.51

fitARIMA_Beli
all:
arima(x = tsBeli, order = c(1, 1, 1), seasonal = list(order = c(1, 0, 0), period = 12), 
    method = "ML")

Coefficients:
         ar1     ma1     sar1
      0.0032  0.0509  -0.0026
s.e.  0.6908  0.7059   0.3522

sigma^2 estimated as 457012:  log likelihood = -372.95,  aic = 753.91

predict(fitARIMA_Jual,n.ahead = 1)
$pred
         Jan
2021 14665.9

$se
          Jan
2021 687.1792

predict(fitARIMA_Beli,n.ahead = 1)
$pred
          Jan
2021 14122.35

$se
          Jan
2021 676.0263

futurVal_Beli <- forecast(fitARIMA_Beli,h=1, level=c(99.5))
futurVal_Jual <- forecast(fitARIMA_Jual,h=1, level=c(99.5))

plot(futurVal_Beli)
plot(futurVal_Jual)

summary(futurVal_Jual)
Forecast method: ARIMA(1,1,1)(1,0,0)[12]

Model Information:

Call:
arima(x = tsJual, order = c(1, 1, 1), seasonal = list(order = c(1, 0, 0), period = 12), 
    method = "ML")

Coefficients:
          ar1     ma1    sar1
      -0.0213  0.0836  0.0729
s.e.   1.8380  1.8427  0.2744

sigma^2 estimated as 472215:  log likelihood = -373.76,  aic = 755.51

Error measures:
                   ME     RMSE      MAE      MPE     MAPE      MASE        ACF1
Training set 107.4817 679.9846 237.3794 0.794616 1.695755 0.2583878 -0.02594214

Forecasts:
         Point Forecast  Lo 99.5  Hi 99.5
Jan 2021        14665.9 12736.97 16594.84

summary(futurVal_Beli)
Forecast method: ARIMA(1,1,1)(1,0,0)[12]

Model Information:

Call:
arima(x = tsBeli, order = c(1, 1, 1), seasonal = list(order = c(1, 0, 0), period = 12), 
    method = "ML")

Coefficients:
         ar1     ma1     sar1
      0.0032  0.0509  -0.0026
s.e.  0.6908  0.7059   0.3522

sigma^2 estimated as 457012:  log likelihood = -372.95,  aic = 753.91

Error measures:
                  ME     RMSE      MAE       MPE     MAPE      MASE        ACF1
Training set 106.293 668.9485 220.1657 0.7954476 1.599417 0.2807839 -0.02688605

Forecasts:
         Point Forecast  Lo 99.5  Hi 99.5
Jan 2021       14122.35 12224.72 16019.98

Related Post

Propagating nerve impulse in Hodgkin-Huxley model. Modeling with R. Part 2

Abderrahim Lyoubi-Idrissi — Thu, 10 Feb 2022 19:05:00 +0000

Are you interested in guest posting? Publish at DataScience+ via your RStudio editor.

Introduction

In this second part we will present a numerical method for solving the PDE system, which describes the propagation of action potential. We will make use of the R-Packages deSolve and ReacTran to simulate the model. The underlying Hodgkin-Huxley model used for our simulation is actually based on the telegraph equations. In contrast to the standard models, where the inductance is nelegted, here we will also use the Hodgkin-Huxley model but without neglecting the self conductance of the axon Isn’t there an inductance factor in the plasma membrane of nerves?. This model is based on the $(RLC)$(Resistance-Inductance-Capacitance) electric circuit analogue in which ionic currents through the cylindrical membrane are also taken into account.

Propagation Action Potential

The mathematical equation describing the propagation in space and time of the action potential $V_m$ along a neural axon is given by :
$$ \begin{align}\frac{\partial^2 V_m}{\partial^2x}- LC_a\frac{\partial^2V_m}{\partial^2t}= \frac{2}{a}RC_a\frac{\partial V_m}{\partial t} + \frac{2}{a}L\frac{\partial I_{}ion}{\partial t} + \frac{2}{a}RI_{ion} \hspace{30pt} (C.6)\end{align} $$
Where:
$V_m$ is the potential difference across the membrane (dependent variable, depends on $x$ and $t$).
$x $ is independent variable representing one dimension of three-dimensional space.
$t$ is independent variable representing time.
$L$ is the axon specific self-inductance.
$R$ is the specific resistance of an axon.
$C_a$ is the axon self-capacitance per unit area per unit length.
$I_{ion}$ is the sum of ions currents.
$a$ is the axon radius.
The derivation of the equation $(C.6)$ for axon represented by the $RLC$ (Resistance-Inductance-Capacitance) circuit is performed in Appendix A (In case you are interested in the derivation of the equation $(C.6)$ so just send me a mail at kritiker2017@gmail.com.). Note if the presence of induction in the system is neglected $(L=0)$, the equation $(C.6)$ becomes the non-linear cable equation, which is not resistant to analytical approaches.
In the Hodgkin-Huxley model the ion current $I_{ion}$ is defined as the sum of ions currents of potassium and sodium ( $I_K$ and $I_{Na}$), and a smaller current ($I_L$ ) made up of chloride and other ions:
$$I_{ion} = I_K + I_{Na} + I_L= g_K(V_m -V_K)+g_{Na}(V_m – V_{Na}) + g_L(V_m – V_L) \hspace{30pt} (C.7)$$ where $g_K$ , $g_{Na}$ and $g_L$ are potassium, sodium and leakage conductances, respectively.
We define:
the diffusion coefficient as $D^2 = \frac{a}{2LC_m}$
the relaxation time as $\tau = \frac{L}{R}$
the parameter $\mu = \frac{\tau}{C_m}$ characterizes the inductance in the system.

Substituting the equation $(C.7)$ into equation $(C.6)$ we obtain the final nerve propagation equation in the Hodgkin & Huxley model, which will be used for our simulation.
$$
\begin{align}\tau\frac{\partial^2 V_m}{\partial^2t} = \tau D^2\frac{\partial^2 V_m }{\partial^2 x} – \big[1+\mu(g_{K}n^4 + g_{Na}m^3h+g_{L})\big]\frac{\partial V_m}{\partial t} – \\ g_{K}(\frac{\mu}{\tau}n^4 + 4\mu n^3\frac{\partial n}{\partial t})(V_m – V_K)
– \\ g_{Na}\big[\frac{\mu}{\tau}m^3h + \mu(3m^2h\frac{\partial m}{\partial t}+ m^3\frac{\partial h}{\partial t}) \big](V_m – V_{Na}) – \\ g_L\frac{\mu}{\tau}(V_m – V_L) \hspace{40pt} (C.8)\end{align} $$

Numeric solution

To solve the equation $(C.8)$ we will make use of the R-Packages {deSolve} and {ReacTran}. The [Package deSolve is an add-on package of the open source data analysis system R for the numerical treatment of systems of differential equations.].
In this blog post we solve the equation $(C.8)$ on a one-dimensional domain $\Omega = [0,15]$, with the initial conditions $V_{m}(x, 0) = -15 \exp{-\frac{x^2}{D^2}}$ and $\frac{\partial V_{m}(x,0)}{\partial t} = 0 $ and boundary conditions of Dirichlet type $ V_{m}(x = 0, t) = 0$, $V_{m}(x = 15, t) = 0$. The “method of lines”, where space($\Omega$) is discretized in fixed steps while time is left in continuous form, will be used.

library(ReacTran)
  # Create one-dimensional finite difference grid
  dx <- 1
  xgrid <- setup.grid.1D(x.up = 0, x.down = 10, dx.1 = dx) 
  x <- xgrid$x.mid
  N <- xgrid$N
  # Model Parameters
  ## Passive parameters of the neuron
  a <- 238*10^(-4) # axon radius (cm)
  R <- 35.4        # Membrane Capacitance (Ohm cm)
  L <- 15          # Axon specific self-inductance
  C_m <- 0.001     # Membrane capacitance density Cm 1.0 micro F/cm2
  D_coefficient <- sqrt(a/(2*L*C_m))
  tau <- L/R
  mu <- tau/C_m
  Iinj <- 0       # injected current 
  # Values of the neuron 
  g_K  <- 0.036       # conductance density g_K
  g_Na <- 0.12        # conductance density g_Na
  g_L  <- 0.0003      # conductance density g_L
  v_K  <- 12          # K reversal potential
  v_Na <- -115        # Na reversal potential
  v_L <- -10.5989     # Leak reversal potential
  # Function ion Channel
  ## Rate functions for K activation (variable n)
  alpha_n <- function(v) 0.01*(v+10)/(-1+exp((v+10)/10))
  beta_n  <- function(v) 0.125*exp(v/80)
  ## Rate functions for Na activation (variable m)
  alpha_m <- function(v) 0.1*(v+25)/(-1+exp((v+25)/10))
  beta_m  <- function(v) 4*exp(v/18) 
  # Rate functions for Na inactivation (variable h)
  alpha_h <- function(v) 0.07*exp(v/20)
  beta_h  <- function(v) (1+exp((v+30)/10))^-1
  # Derivatives of ion channel functions/Kinetic equations for channel variables
  dndt <- function(n,v)(alpha_n(v)*(1-n)-beta_n(v)*n)
  dmdt <- function(m,v)(alpha_m(v)*(1-m)-beta_m(v)*m)
  dhdt <- function(h,v)(alpha_h(v)*(1-h)-beta_h(v)*h)
  # Initial conditions  
  ## In the resting state V = 0
  V_0  <- 0
  n_0  <- alpha_n(V_0)/(alpha_n(V_0)+beta_n(V_0))
  m_0  <- alpha_m(V_0)/(alpha_m(V_0)+ beta_m(V_0))
  h_0  <- alpha_h(V_0)/(alpha_h(V_0)+beta_h(V_0))
  vini <- (-15)*exp(-x^2/D_coefficient^2)
  # vini <- -15*sin(pi*x)
  # vini <- rep(-15, N)
  uini  <- rep(0, N)
  nini  <- rep(n_0, N)
  mini  <- rep(m_0, N)
  hini  <- rep(h_0, N)
  yini  <- c(vini, uini, nini, mini, hini)
  # Model equations/Differential equations 
  Pulse_propagation <- function (t, y, parms) {
    v <- y[1:N]
    u <- y[(N+1):(2*N)]
    n <- y[(2*N+1):(3*N)]
    m <- y[(3*N+1):(4*N)]
    h <- y[(4*N+1):(5*N)]
    
    dv <- u
    du <- tran.1D(C = v, C.up = 0, C.down = 0,  D = D_coefficient, dx = xgrid)$dC - 
      (1/tau + (mu/tau)*(g_K*n^4 + g_Na*m^3*h + g_L))*u - 
      g_K*((mu/tau^2)*n^4 + 4*(mu/tau)*n^3*dndt(n,v))*(v-v_K) - 
      g_Na*((mu/tau^2)*m^3*h + (mu/tau)*(3*m^2*h*dmdt(m,v) + m^3*dhdt(h,v)))*(v -v_Na) - 
      g_L*(mu/tau^2)*(v - v_L) + Iinj 
      
    dn <- (alpha_n(v)*(1-n)-beta_n(v)*n)
    dm <- (alpha_m(v)*(1-m)-beta_m(v)*m)
    dh <- (alpha_h(v)*(1-h)-beta_h(v)*h)
    return(list(c(dv,du, dn, dm, dh)))
  }

We’ve done with building the model. In the above model code the action potential $V_m$ is represented by the state variable v, which represents the dynamical attitude of the transmission for the nerve impulses of a nervous system. We define now the time simulation and run the model to solve the equation $(C.8)$.

 # Specify the time at which the output is wanted  
  time_3 <- seq(0,15, 0.01)

The defined model is one-dimensional (one spatial independent variable), so we use the function ode.1D here. The time it takes to solve the model (in seconds) is also printed.

 print(system.time(
    out_3 <- ode.1D(y = yini, func = Pulse_propagation, 
                    times = time_3, method = "adams", parms = NULL, 
                    dimens = N, nspec = 5,  
                    names = c("v", "u", "n", "m", "h"))))
       User      System     verstrichen 
       0.21        0.00        0.20

The model output/results “out_3” is a matrix, which contains all the data needed to analyse and visualize the results of the simulation.
Before plotting the numeric solution of the PDE $(C.8)$, we will first check if the integration was successful, and to get detailed information about the success of our simulation we use the function diagnostics{deSolve} and the function summary{deSolve}

Diagnostics and Summary

# Print detailed information about the success of our 
  diagnostics(out_3)
--------------------
lsode return code
--------------------

  return code (idid) =  2 
  Integration was successful.

--------------------
INTEGER values
--------------------

  1 The return code : 2 
  2 The number of steps taken for the problem so far: 1950 
  3 The number of function evaluations for the problem so far: 2120 
  5 The method order last used (successfully): 4 
  6 The order of the method to be attempted on the next step: 4 
  7 If return flag =-4,-5: the largest component in error vector 0 
  8 The length of the real work array actually required: 820 
  9 The length of the integer work array actually required: 20 
 14 The number of Jacobian evaluations and LU decompositions so far: 0 
 
--------------------
RSTATE values
--------------------

  1 The step size in t last used (successfully): 0.01 
  2 The step size to be attempted on the next step: 0.01 
  3 The current value of the independent variable which the solver has reached: 15.00627 
  4 Tolerance scale factor > 1.0 computed when requesting too much accuracy: 0

  summary(out_3)

                    v             u            n            m            h
Min.    -1.052214e+02 -3.101267e+02 3.176769e-01 1.396179e-02 7.750275e-02
1st Qu. -2.231145e+00 -1.013699e+00 3.177112e-01 2.289523e-02 2.584720e-01
Median  -4.608204e-06 -7.997909e-02 3.706068e-01 5.293269e-02 5.108495e-01
Mean    -8.208256e+00  5.239505e-01 4.551956e-01 1.883197e-01 4.269989e-01
3rd Qu.  7.670459e+00  4.150809e-06 5.910743e-01 7.744576e-02 5.960693e-01
Max.     1.204040e+01  7.670492e+01 7.681718e-01 9.940305e-01 5.961208e-01
N        1.501000e+04  1.501000e+04 1.501000e+04 1.501000e+04 1.501000e+04
sd       2.720432e+01  3.933037e+01 1.595711e-01 3.139004e-01 1.858227e-01

Plot the results

Time evolution of membrane potential of H&H neuron at various distances along the axon.

As we can see in the following plot, when an excitable membrane is incorporated into a nonlinear equation $(C.6)$ and we use accommodated initial and boundary conditions, the model can give rise to traveling waves of electrical excitation and the action potential repeats itself at successive locations along the axon. The magnitude and duration of action potential remains the same throughout the propagation along the neuron’s axon.

 library(plotly)
  library(see)
  # Gather columns into key-value pairs
  ldata % tidyr::gather(Shape, Action_Potential, -time) 
  # head(ldata)
  ldata_1_9 %filter(Shape == (1):(N-1))
  plot_ldata <- ggplot(ldata_1_9, 
         aes(x = time,
             y = - Action_Potential, 
             group = Shape,
             col = Shape
             )) + labs(title = "Action Potential profiles through time \n at various axon distances") + geom_line() + theme_abyss() # + geom_hline(yintercept = 15, color = "red", lwl =2)
  ggplotly(plot_ldata)

Figure 1: Numerical solution of equation $(C.8)$ showing the time course of action potential. The action potential repeats itself at successive locations along the axon.

3D Plot of the numeric solution

To visualize the results in 3D the R version of the package{plotly} is used.

 action_potential <- subset(out_3, which = c("v"))
  fig_surface % 
    layout(title = list(text = " 3D Plot of the numeric solution", y = 0.95 ), scene = list(
            xaxis = list(title = "Distance"),
            yaxis = list(title = "Time"),
            zaxis = list(title = "v: Action Potential")
            )
           )
  fig_surface

Figure 2: 3D Presentation of the numerical solution of equation $(C.8)$.

The following plots are showing the time course of the other dependent variables, n, m, h and dv/dt.

Plot_all_gatting <- function(x,y, my_title){
    library(plotly)
    library(see)
      ldata %tidyr::gather(Shape, variable, -time) 
    # head(ldata)
      ldata_1_9 %filter(Shape == x:y)
      plot_ldata <- ggplot(ldata_1_9, aes(x = time, y = variable , group = Shape, col = Shape )) + 
      labs(title = my_title ) +
       geom_line() + theme_radar_dark()
      }
  n <- Plot_all_gatting(2*N+1, 3*N, "n profiles through time \n at various axon distances")
  m <- Plot_all_gatting(3*N+1,4*N, "m profiles through time \n at various axon distances")
  h <- Plot_all_gatting(4*N+1,5*N, " h profiles through time \n at various axon distances")
  u <- Plot_all_gatting(N+1, 2*N, "dV/dt profiles through time \n at various axon distances")
  gridExtra::grid.arrange(n, m, h, u, ncol =2)

Figure 3: Time courses of dv/dt and the gating functions n,m,h.

3D Plots of action Potential, and the gating variables n,m,h

dVdt <- subset(out_3, which = c("u"))
  M <-   subset(out_3, which = c("m"))
  N <-   subset(out_3, which = c("n"))
  H <-   subset(out_3, which = c("h"))
  # custom grid style
  axx <- list(
    gridcolor='rgb(255, 255, 255)',
    zerolinecolor='rgb(255, 255, 255)',
    showbackground= TRUE,
    backgroundcolor='rgb(230, 230,230)')
  
  # individual plots
  fig1 <- plot_ly(x = x, y = time_3,  z = - dVdt , scene='scene')
  fig1 % add_surface(showscale=FALSE)
  
  fig2 <- plot_ly(x = x, y = time_3,z = M, scene='scene2')
  fig2 % add_surface(showscale=FALSE)
  
  fig3 <- plot_ly(x = x, y = time_3,z = N, scene='scene3')
  fig3 % add_surface(showscale=FALSE)
  
  fig4 <- plot_ly(x = x, y = time_3,z = H, scene='scene4')
  fig4 % add_surface(showscale=FALSE)
  
  # subplot and define scene
  fig <- subplot(fig1, fig2, fig3, fig4) 
  fig % layout(title = "3D Plots of the numeric solution for dv/dt, n,m and h", 
                        scene = list(domain=list(x=c(0,0.5),y=c(0.5,1)),
                                     xaxis = c(axx, list(title = "Distance")),
                                     yaxis = c(axx, list(title = "Time")),
                                     zaxis = c(axx, list(title = "dv/dt")),
                                     aspectmode='cube'),
                        scene2 = list(domain=list(x=c(0.5,1),y=c(0.5,1)),
                                      xaxis = c(axx, list(title = "Distance")),
                                      yaxis = c(axx, list(title = "Time")),
                                      zaxis = c(axx, list(title = "Gatting \n variable m")),
                                      aspectmode='cube'),
                        scene3 = list(domain=list(x=c(0,0.5),y=c(0,0.5)),
                                      xaxis = c(axx, list(title = "Distance")),
                                      yaxis = c(axx, list(title = "Time")),
                                      zaxis = c(axx, list(title = "Gatting \n variable n")),
                                      aspectmode='cube'),
                        scene4 = list(domain=list(x=c(0.5,1),y=c(0,0.5)),
                                      xaxis = c(axx, list(title = "Distance")),
                                      yaxis = c(axx, list(title = "Time")),
                                      zaxis = c(axx, list(title = "Gatting \n variable h")),
                                      aspectmode='cube'))
  
  
  fig

Fig 4: 3D representation of the numerical solution of dv/dt, n, m and h.
and finally the contour plot

Contour Plot

 action_potential <- subset(out_3, which = "v")
  fig_prop <- plot_ly(visible = TRUE, 
                 x = x ,
                 y = time_3, 
                 z = action_potential, 
                 type = "contour", contours = list(showlabels = TRUE))
  fig_prop % colorbar(title = "The independent \n variable") %>% 
    layout(title = list(text = "Variable density", y = 0.98), xaxis = list(title = "Distance "), yaxis = list(title =  "Time" ))
  fig_prop

Shiny application

Using my new Shiny application (not yet published, it is still under development), one can explore the spatiotemporal dynamics of the action potential and the ionic currents. Just to show an example, when we change the initial value of the action potential in the Shiny application to -65mv (instead of -15mv in the above model) the model produces more than one spike and generates a periodic response. With this app one can also study the influence of the axon specific self-inductance on the spatiotemporal dynamics of the action potential and its proprieties (all-or-none rule, ….). In the Shiny application if we change the initial value of the action potential and then just click the update button, so one can get the results as shown below:

Summary and Outlook

In this blog post we solved the one dimensional PDE using only R-Packages, this hyperbolic PDE is describing the dynamics of the action potential along an axon. Our numerical solution shows the well known results, namely the transmission/production of the action potential along the axon does not change its magnitude and shape.
Next I will investigate a more realistic model, the two dimension nerve pulse propagation model.

References

Mathematical Modelling of Nerve Pulse Transmission : https://core.ac.uk/download/pdf/236625972.pdf
Cable Theory: https://pages.jh.edu/motn/coursenotes/cable.pdf
Solving Differential Equations in R :
https://www.researchgate.net/publication/41530790_Solving_Differential_Equations_in_R_Package_deSolve
Plotly in R: https://plotly.com/r/getting-started
Creating Shiny Application with bs4Dash: https://rinterface.github.io/bs4Dash/index.html

Contact:

kritiker2017@gmail.com.

Related Post

Topic Modeling and Latent Dirichlet Allocation (LDA)

Enes Zvornicanin — Sun, 30 Jan 2022 16:26:45 +0000

Are you interested in guest posting? Publish at DataScience+ via your RStudio editor.

Topic Modeling

Topic modeling is a natural language processing (NLP) technique for determining the topics in a document. Also, we can use it to discover patterns of words in a collection of documents. By analyzing the frequency of words and phrases in the documents, it’s able to determine the probability of a word or phrase belonging to a certain topic and cluster documents based on their similarity or closeness.

Firstly, topic modeling starts with a large corpus of text and reduces it to a much smaller number of topics. Topics are found by analyzing the relationship between words in the corpus. Also, topic modeling finds which words frequently co-occur with others and how often they appear together.

The model tries to find clusters of words that co-occur more frequently than they would otherwise expect due to chance alone. This gives a rough idea about topics in the document and where they rank on its hierarchy of importance.
The current methods for extraction of topic models include Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), Probabilistic Latent Semantic Analysis (PLSA), and Non-Negative Matrix Factorization (NMF). In this article, we’ll focus on Latent Dirichlet Allocation (LDA).

The reason topic modeling is useful is that it allows the user to not only explore what’s inside their corpus (documents) but also build new connections between topics they weren’t even aware of. Some applications of topic modeling also include text summarization, recommender systems, spam filters, and similar.

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) is an unsupervised clustering technique that is commonly used for text analysis. It’s a type of topic modeling in which words are represented as topics, and documents are represented as a collection of these word topics.

For this purpose, we’ll describe the LDA through topic modeling. Thus, let’s imagine that we have a collection of documents or articles. Each document has a topic such as computer science, physics, biology, etc. Also, some of the articles might have multiple topics. The problem is that we have only articles but not their topics and we would like to have an algorithm that is able to sort documents into topics.

Sampling Topics

We can imagine that LDA will place documents in the space according to the document topics. For example, in our case with topics computer science, physics, and biology, LDA will put documents into a triangle where corners are the topics. We can see this in the image below where each orange circle represents one document.

As we’ve said, some documents might have several topics and an example of that is the document between computer science and biology in the image above. For instance, it’s possible if the document is about biotechnology. In probability and statistics, such kind of distribution is called Dirichlet distribution and it’s controlled by the parameter $\alpha$.

For example, $\alpha=1$ indicates that samples are more evenly distributed over the space, $\alpha>1$ means that samples are gathering in the middle, and $\alpha<1$ indicates that samples tend towards corners. Also, parameter $\alpha$ is usually a $k$-dimensional vector, where each component corresponds to each corner or topic in our case. We can observe this behavior in the image below.

Next, if we consider the document about biotechnology that we mentioned above, it might consist of 50% computer science, 45% biology, and 5% physics. Generally, we can define this distribution of topics over a document as multinomial distribution with parameter $\theta$. Accordingly, the $\theta$ parameter is a $k$-dimensional vector of probabilities, which must sum to 1. After that, we sample from the multinomial distribution $N$ different topics. In order to understand this process, we can observe the image below.

Sampling Words

After picking $N$ different topics, we would also need to sample words. For that purpose, we’ll also use Dirichlet and multinomial distribution. The second Dirichlet distribution, defined with parameter $\beta$, maps topics in word space. For instance, the corners of a triangle, tetrahedron in case of 4 dimensions or simplex for $n$ dimensions, might be now words such as algorithm, genetic, velocity, or similar.

Instead of documents, now we are placing topics into this space. For example, the topic of computer science is closer to the word algorithm rather than to the word genetic, and the multinomial distribution of words for this topic might consist of 75% algorithm, 15% genetic, and 10% velocity. Similarly, we can define multinomial distributions for topic biology as 10% algorithm, 85% genetic, and 5% velocity, and topic physics as 20% algorithm, 5% genetic, and 75% velocity.

Also, after defining multinomial distributions for topics, we’ll sample words from those distributions corresponding to each topic sampled in the first step. This process can be more easily understood through the illustration below.

For example, if we consider a blue circle that represents a computer science topic, this topic has its own distribution by words that we’re using. Next, following the same order of sampled topics, for each topic, we select one word based on the topics distribution. For instance, the first topic above is computer science, and based on probability 0.75, 0.15, and 0.1 for words algorithm, genetic, and velocity respectively, we selected the word algorithm. Following this process, LDA creates a new document.

LDA Definition

In this way, for each input document, we create a new one. After all, we want to maximize the probability of creating the same document and the whole process above is mathematically defined as

$$
P(\boldsymbol{W}, \boldsymbol{Z}, \boldsymbol{\theta}, \boldsymbol{\phi}; \alpha, \beta) = \prod_{i = 1}^{M}P(\theta_{j}; \alpha)\prod_{i = 1}^{K}P(\phi; \beta)\prod_{t = 1}^{N}P(Z_{j, t} | \theta_{j})P(W_{j, t} | \phi z_{j, t}),
$$

$\alpha$ and $\beta$ define Dirichlet distributions, $\theta$ and $\phi$ define multinomial distributions, $\boldsymbol{Z}$ is the vector with topics of all words in all documents, $\boldsymbol{W}$ is the vector with all words in all documents, $M$ number of documents, $K$ number of topics and $N$ number of words.
The whole process of training or maximizing probability can be done using Gibbs sampling where the general idea is to make each document and each word as monochromatic as possible. Basically, it means we want that each document have as few as possible articles and each word belongs to as few as possible topics.

Example

In this example, we’ll use the 20 newsgroups text dataset. The 20 newsgroups dataset comprises around 12000 newsgroups posts on 20 topics. Let’s load the data and all the needed packages.

import pandas as pd
import re
import numpy as np
from sklearn.datasets import fetch_20newsgroups
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

from gensim import corpora, models
from gensim.models.ldamulticore import LdaMulticore
from gensim.models.coherencemodel import CoherenceModel
import pyLDAvis.gensim

newsgroups_train = fetch_20newsgroups(subset='train')

df = pd.DataFrame({'post': newsgroups_train['data'], 'target': newsgroups_train['target']})
df['target_names'] = df['target'].apply(lambda t: newsgroups_train['target_names'][t])
df.head()
 	post 	target 	target_names
0 	From: lerxst@wam.umd.edu (where's my thing)\nS... 	7 	rec.autos
1 	From: guykuo@carson.u.washington.edu (Guy Kuo)... 	4 	comp.sys.mac.hardware
2 	From: twillis@ec.ecn.purdue.edu (Thomas E Will... 	4 	comp.sys.mac.hardware
3 	From: jgreen@amber (Joe Green)\nSubject: Re: W... 	1 	comp.graphics
4 	From: jcm@head-cfa.harvard.edu (Jonathan McDow... 	14 	sci.space

As a text preprocessing step, we’ll first remove URLs, HTML tags, emails, and non-alpha characters. After that, we’ll lemmatize it and remove stopwords.

def remove_urls(text):
    " removes urls"
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', text)
    
def remove_html(text):
    " removes html tags"
    html_pattern = re.compile('')
    return html_pattern.sub(r'', text)

def remove_emails(text):
    email_pattern = re.compile('\S*@\S*\s?')
    return email_pattern.sub(r'', text)

def remove_new_line(text):
    return re.sub('\s+', ' ', text)

def remove_non_alpha(text):
    return re.sub("[^A-Za-z]+", ' ', str(text))

def preprocess_text(text):
    t = remove_urls(text)
    t = remove_html(t)
    t = remove_emails(t)
    t = remove_new_line(t)
    t = remove_non_alpha(t)
    return t

def lemmatize_words(text, lemmatizer):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

def remove_stopwords(text, stopwords):
    return " ".join([word for word in str(text).split() if word not in stopwords])


df['post_preprocessed'] = df['post'].apply(preprocess_text).str.lower()

print('lemming...')
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
df['post_final'] = df['post_preprocessed'].apply(lambda post: lemmatize_words(post, lemmatizer))

print('remove stopwors...')

nltk.download('stopwords')
swords = set(stopwords.words('english'))

df['post_final'] = df['post_preprocessed'].apply(lambda post: remove_stopwords(post, swords))
df.head()

	post 	target 	target_names 	post_preprocessed 	post_final
0 	From: lerxst@wam.umd.edu (where's my thing)\nS... 	7 	rec.autos 	from where s my thing subject what car is this... 	thing subject car nntp posting host rac wam um...
1 	From: guykuo@carson.u.washington.edu (Guy Kuo)... 	4 	comp.sys.mac.hardware 	from guy kuo subject si clock poll final call ... 	guy kuo subject si clock poll final call summa...
2 	From: twillis@ec.ecn.purdue.edu (Thomas E Will... 	4 	comp.sys.mac.hardware 	from thomas e willis subject pb questions orga... 	thomas e willis subject pb questions organizat...
3 	From: jgreen@amber (Joe Green)\nSubject: Re: W... 	1 	comp.graphics 	from joe green subject re weitek p organizatio... 	joe green subject weitek p organization harris...
4 	From: jcm@head-cfa.harvard.edu (Jonathan McDow... 	14 	sci.space 	from jonathan mcdowell subject re shuttle laun... 	jonathan mcdowell subject shuttle launch quest...

Next, we’ll make the dictionary and corpus. Also, as a corpus, we can use only term frequency or TF-IDF.

posts = [x.split(' ') for x in df['post_final']]
id2word = corpora.Dictionary(posts)
corpus_tf = [id2word.doc2bow(text) for text in posts]
print(corpus_tf[0])
[(0, 1), (1, 2), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 5), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1)]

tfidf = models.TfidfModel(corpus_tf)
corpus_tfidf = tfidf[corpus_tf]
print(corpus_tfidf[0])
[(0, 0.11498876048525103), (1, 0.09718324368673029), (2, 0.10251459464215813), (3, 0.22950467156922127), (4, 0.11224887707193924), (5, 0.1722981301822569), (6, 0.07530011969613486), (7, 0.4309484809469165), (8, 0.08877590143625969), (9, 0.04578068195160004), (10, 0.07090803901993002), (11, 0.1222656727768876), (12, 0.14524649469964415), (13, 0.05251249361530128), (14, 0.0989263305425191), (15, 0.04078267609390185), (16, 0.11756371552272524), (17, 0.17436169259993298), (18, 0.10155337594190954), (19, 0.20948825386578207), (20, 0.09695491629716278), (21, 0.024520714650907785), (22, 0.12964907508803875), (23, 0.08179595178219969), (24, 0.035633159058452026), (25, 0.11020678338364179), (26, 0.24952108927266048), (27, 9.459268363417395e-05), (28, 0.10776183582290975), (29, 0.07547376776331942), (30, 0.06670829980433708), (31, 0.062106577059591), (32, 0.13626396477950442), (33, 0.10453869332078215), (34, 0.07661054771383646), (35, 0.17037224424255862), (36, 0.024905114157890113), (37, 0.0011640619492058468), (38, 0.12139841280668175), (39, 0.054717960920777436), (40, 0.02308905209371841), (41, 0.13459748784234876), (42, 0.20608696405865523), (43, 0.056503689640334795), (44, 0.09456465243547033), (45, 0.09876981207502786), (46, 0.12006279504111743), (47, 0.08461773880033642), (48, 0.13486864088205006), (49, 0.13432885719305454), (51, 0.24952108927266048), (52, 0.05421309514981315), (53, 0.064793199454388), (54, 0.16160262905222716), (55, 0.027057268862720633), (56, 0.1954679598913907), (57, 0.09504085428857881), (58, 0.105116264304804), (59, 0.06248175923527969)]

After that, we test both corpus using the LDA model. We’ll measure their performance using coherence score UMass as a more commonly used CV score might not give good results. More about coherence scores can be found in this article.

In order to see keywords for each topic, we use method `show_topics`. Basically, it shows the topic index and weightage of each keyword.

model = LdaMulticore(corpus=corpus_tf,id2word = id2word, num_topics = 20,
                     alpha=.1, eta=0.1, random_state = 0)

coherence = CoherenceModel(model = model, texts = posts, dictionary = id2word, coherence = 'u_mass')

print(coherence.get_coherence())
print(model.show_topics())
-1.6040665431701946
[(6, '0.010*"god" + 0.008*"people" + 0.007*"one" + 0.006*"would" + 0.006*"subject" + 0.005*"lines" + 0.004*"article" + 0.004*"writes" + 0.004*"organization" + 0.004*"may"'),
 (12, '0.008*"subject" + 0.008*"lines" + 0.008*"organization" + 0.006*"article" + 0.006*"writes" + 0.006*"would" + 0.005*"one" + 0.005*"x" + 0.004*"university" + 0.004*"posting"'),
...
...

Visualize the topics-keywords

After we built the LDA model, the next step is to visualize results using pyLDAvis package. It’s an interactive chart that shows topics and keywords.

On the left side, topics are represented as circles. The larger the circle, the more prevalent is that topic. A good topic model will have big, non-overlapping circles scattered throughout the chart instead of being clustered in one quadrant.

On the right side, we can observe the most relevant keywords from the selected topic.

lda_display = pyLDAvis.gensim.prepare(model, corpus_tf, id2word, sort_topics = False)
pyLDAvis.display(lda_display)

In the end, dominant topic and contribution percent of that topic is extracted.

data_dict = {'dominant_topic':[], 'perc_contribution':[], 'topic_keywords':[]}

for i, row in enumerate(model[corpus_tf]):
    #print(i)
    row = sorted(row, key=lambda x: x[1], reverse=True)
    #print(row)
    for j, (topic_num, prop_topic) in enumerate(row):
        wp = model.show_topic(topic_num)
        topic_keywords = ", ".join([word for word, prop in wp])
        data_dict['dominant_topic'].append(int(topic_num))
        data_dict['perc_contribution'].append(round(prop_topic, 3))
        data_dict['topic_keywords'].append(topic_keywords)
        #print(topic_keywords)
        break

df_topics = pd.DataFrame(data_dict)

contents = pd.Series(posts)

df_topics['post'] = df['post']
df_topics.head()

Further work

This tutorial represents only the theoretical background and baseline model for topic modeling and the LDA algorithm. Thus, the presented results might not be the best possible, and a lot of space is left for improving them. For instance, we might try using different text preprocessing methods, tune LDA hyperparameters, such as the number of topics, alpha, beta, try different coherence metrics, and similar.

References

Latent Dirichlet Allocation (Part 1 and 2), https://www.youtube.com/watch?v=T05t-SqKArY
Topic Modeling With Gensim (Python), https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/
When Coherence Score is Good or Bad in Topic Modeling?, https://www.baeldung.com/cs/topic-modeling-coherence-score
Topic modeling guide (GSDM,LDA,LSI), https://www.kaggle.com/ptfrwrd/topic-modeling-guide-gsdm-lda-lsi
Beginners Guide to Topic Modeling in Python, https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/

Related Post

Ditch p-values. Use Bootstrap confidence intervals instead

Florent Buisson — Mon, 08 Nov 2021 19:55:39 +0000

Are you interested in guest posting? Publish at DataScience+ via your RStudio editor.

Tags

P-values don’t mean what people think they mean; they rely on hidden assumptions that are unlikely to be fulfilled; they detract from the real questions. Here’s how to use the Bootstrap in R instead

This post was first published on Towards Data Science and is based on my book “Behavioral Data Analysis with R and Python”

A few years ago, I was hired by one of the largest insurance companies in the US to start and lead their behavioral science team. I had a PhD in behavioral economics from one of the top 10 economics departments in the world and half a decade of experience as a strategy consultant, so I was confident my team would be able to drive business decisions through well-crafted experiments and data analyses.
And indeed, I worked with highly-skilled data scientists who had a very sharp understanding of statistics. But after years of designing and analyzing experiments, I grew dissatisfied with the way we communicated results to decision-makers. I felt that the over-reliance on p-values led to sub-optimal decisions. After talking to colleagues in other companies, I realized that this was a broader problem, and I set up to write a guide to better data analysis. In this article, I’ll present one of the biggest recommendations of the book, which is to ditch p-values and use Bootstrap confidence intervals instead.

Ditch p-values

There are many reasons why you should abandon p-values, and I’ll examine three of the main ones here:

They don’t mean what people think they mean
They rely on hidden assumptions that are unlikely to be fulfilled
They detract from the real questions

1. They don’t mean what people think they mean

When you’re running applied data analyses, whether in the private, non-profit or public sectors, your goal is to inform decisions. A big part of the problem we need to solve is uncertainty: the data tells us that the number is 10, but could it be 5 instead? Maybe 100? Can we rely on the number that the data analysis spouted? After years, sometimes decades, of educating business partners on the matter, they generally understand the risks of uncertainty. Unfortunately, they often jump from there to assuming that the p-value represents the probability that a result is due to chance: a p-value of 0.03 would mean that there’s a 3% chance a number we thought were positive is indeed null or negative. It does not. In fact, it represents the probability of seeing the result we saw assuming that the real value is indeed zero.
In scientific jargon, the real value being zero or negative is called the null hypothesis (abbreviated H0), and the real value being strictly above zero is called the alternative hypothesis (abbreviated H1). People mistakenly believe that the p-value is the probability of H0 given the data, P(H0|data), when in reality it is the probability of the data given H0, P(data|H0). You may be thinking: potato po-tah-to, that’s hair splitting and a very small p-value is indeed a good sign that a result is not due to chance. In many circumstances, you’ll be approximately correct, but in some cases, you’ll be utterly wrong.
Let’s take a simplified but revealing example: we want to determine Robert’s citizenship. Null hypothesis: H0, Robert is a US citizen. Alternative hypothesis: H1, he is not. Our data: we know that Robert is a US senator. There are 100 senators out of 330 million US citizens, so under the null hypothesis, the probability of our data (i.e., the p-value) is 100 / 300,000,000 ≈ 0.000000303. Per the rules of statistical significance, we can safely conclude that our null hypothesis is rejected and Robert is not a US citizen. That’s obviously false, so what went wrong? The probability that Robert is a US senator is indeed very low if he is a US citizen, but it’s even lower if he is not (namely zero!). P-values cannot help us here, even with a stricter 0.01 or even 0.001 threshold (for an alternative illustration of this problem, see xkcd).

2. They rely on hidden assumptions

P-values were invented at a time when all calculations had to be done by hand, and so they rely on simplifying statistical assumption. Broadly speaking, they assume that the phenomenon you’re observing obeys some regular statistical distribution, often the normal distribution (or a distribution from the same family). Unfortunately, that’s rarely true²:

Unless you’re measuring some low-level psycho-physiological variable, your population of interest is generally made up of heterogeneous groups. Let’s say you’re a marketing manager for Impossible Burgers looking at the demand for meat substitutes. You would have to account for two groups: on the one hand, vegetarians, for whom the relevant alternative is a different vegetarian product; on the other hand, meat eaters, who can be enticed but will care much more about taste and price compared to meat itself.
A normal distribution is symmetrical and extends to infinity in both directions. In real life, there are asymmetries, threshold and limits. People never buy negative quantities, nor infinite ones. They don’t vote at all before they’re 18 and the market of 120-year-old is much narrower than the market for 90-year-old and 60-year-old would suggest.
Conversely, we see “fat-tailed” distributions, where extreme values are much more frequent than expected from a normal distribution. There are more multi-billionaires than you would expect from looking at the number of millionaires and billionaires.

This implies that the p-values coming from a standard model are often wrong. Even if you correctly treat them as P(data|H0) and not P(H0|data), they’ll often be significantly off.

3. They detract from the real questions

Let’s say that you have taken to heart the two previous issues and built a complete Bayesian model that finally allows you to properly calculate P(H0|data), the probability that the real value is zero or negative given the observed data. Should you bring it to your decision-maker? I would argue that you shouldn’t, because it doesn’t reflect economic outcomes.
Let’s say that a business decision-maker is pondering two possible actions, A and B. Based on observed data, the probability of zero or negative benefits is:

0.08 for action A
0.001 for action B

Should the decision-maker pick action B based on these numbers? What if I told you that the corresponding 90% confidence intervals are:

[-$0.5m; $99.5m] for action A
[$0.1m; $0.2m] for action B

Action B may have a lower probability of leading to a zero or negative outcome, but its expected value for the business is much lower, unless the business is incredibly risk-averse. In most situations, “economic significance” for a decision-maker hangs on two questions:
How much money are we likely to gain? (aka, the expected value)
In a “reasonably-likely worst-case scenario”, how much money do we stand to lose? (aka, the lower bound of the confidence interval)
Confidence intervals are a much better tool to answer these questions than a single probability number.

Use the Bootstrap instead

Let’s take a concrete example, adapted and simplified from my book¹. A company has executed a time study of how long it takes its bakers to prepare made-to-order cakes depending on their level of experience. Having an industrial engineer measure how long it takes to make a cake in various stores across the country is expensive and time-consuming, so the data set has only 10 points, as you can see in the following figure.

In addition to the very small size of the sample, it contains an an outlier, in the upper left corner: a new employee spending most of a day working on a complex cake for a corporate retreat. How should the data be reported to business partners? One could discard the extreme observation. But that observation, while unusual, is not an aberration per se. There was no measurement error, and those circumstances probably occur from time to time. An other option would be to only report the overall mean duration, 56 minutes, but that would also be misleading because it would not convey the variability and uncertainty in the data.
Alternatively, one could calculate a normal confidence interval (CI) based on the traditional statistical assumptions. Normal confidence intervals are closely linked to the p-value: an estimate is statistically significant if and only if the corresponding confidence interval does not include zero. As you’ll learn in any stats class, the lower limit of a normal 95%-CI is equal to the mean minus 1.96 times the standard error, and the upper limit is equal to the mean plus 1.96 times the standard error. Unfortunately, in the present case the confidence interval is [-23;135] and we can imagine that business partners would not take too kindly to the possibility of negative baking duration…
This issue comes from the assumption that baking times are normally distributed, which they are obviously not. One could try to fit a better distribution, but using a Bootstrap confidence interval is much simpler.

<3>1. The Bootstrap works by drawing with replacement
To build Bootstrap confidence intervals, you simply need to build “a lot of similar samples” by drawing with replacement from your original sample. Drawing with replacement is very simple in R, we just set “replace” to true:
boot_dat <- slice_sample(dat, n=nrow(dat), replace = TRUE)
Why are we drawing with replacement? To really grasp what’s happening with the Bootstrap, it’s worth taking a step back and remembering the point of statistics: we have a population that we cannot fully examine, so we’re trying to make inferences about this population based on a limited sample. To do so, we create an “imaginary” population through either statistical assumptions or the Bootstrap. With statistical assumptions we say, “imagine that this sample is drawn from a population with a normal distribution,” and then make inferences based on that. With the Bootstrap we’re saying, “imagine that the population has exactly the same probability distribution as the sample,” or equivalently, “imagine that the sample is drawn from a population made of a large (or infinite) number of copies of that sample.” Then drawing with replacement from that sample is equivalent to drawing without replacement from that imaginary population. As statisticians will say, “the Bootstrap sample is to the sample what the sample is to the population.”

We repeat that process many times (let’s say 2,000 times):
mean_lst <- list() B <- 2000 N <- nrow(dat) for(i in 1:B){ boot_dat <- slice_sample(dat, n=N, replace = TRUE) M <- mean(boot_dat$times) mean_lst[[i]] <- M} mean_summ <- tibble(means = unlist(mean_lst))
The result of the procedure is a list of Bootstrap sample means, which we can plot with an histogram:

As you can see, the histogram is very irregular: there is a peak close to the mean of our original data set along with smaller peaks corresponding to certain patterns. Given how extreme our outlier is, each of the seven peaks corresponds to its number of repetitions in the Bootstrap sample, from zero to six. In other words, it doesn’t appear at all in the samples whose means are in the first (leftmost) peak, it appears exactly once in the samples whose means are in the second peak, and so on.

2. Calculating the confidence interval from the Bootstrap samples

From there, we can calculate the Bootstrap confidence interval (CI). The bounds of the CI are determined from the empirical distribution of the preceding means. Instead of trying to fit a statistical distribution (e.g., normal), we can simply order the values from smallest to largest and then look at the 2.5% quantile and the 97.5% quantile to find the two-tailed 95%-CI. With 2,000 samples, the 2.5% quantile is equal to the value of the 50th smallest mean (because 2,000 * 0.025 = 50), and the 97.5% quantile is equal to the value of the 1950th mean from smaller to larger, or the 50th largest mean (because both tails have the same number of values). Fortunately, we don’t have to calculate these by hand:
LL_b <- as.numeric(quantile(mean_summ$means, c(0.025))) UL_b <- as.numeric(quantile(mean_summ$means, c(0.975)))
The following figure shows the same histogram as before but adds the mean of the means, the normal CI bounds and the Bootstrap CI bounds.

Distribution of the means of 2,000 samples, with mean of the means (thick blue line), normal 95%-CI bounds (dotted black lines), and Bootstrap CI bounds (dashed blue lines)[/caption]
The Bootstrap 95%-CI is [7.50; 140.80] (plus or minus some sampling difference), which is much more realistic. No negative values as with the normal assumptions!

3. The Bootstrap is more flexible and relevant for business decisions

Once you start using the Bootstrap, you’ll be amazed at its flexibility. Small sample size, irregular distributions, business rules, expected values, A/B tests with clustered groups: the Bootstrap can do it all!
Let’s imagine, for example, that the company in our previous example is considering instituting a time promise - ”your cake in three hours or 50% off” - and wants to know how often a cake currently takes more than three hours to be baked. Our estimate would be the sample percentage: it happens in 1 of the 10 observed cases, or 10%. But we can’t leave it at that, because there is significant uncertainty around that estimate, which we need to convey. Ten percent out of 10 observations is much more uncertain than 10% out of 100 or 1,000 observations. So how could we build a CI around that 10% value? With the Bootstrap, of course!
The histogram of Bootstrap estimates also offers a great visualization to show business partners: “This vertical line is the result of the actual measurement, but you can also see all the possible values it could have taken”. The 90% or 95% lower bound offers a “reasonable worst case scenario” based on available information.
Finally, if your boss or business partners are dead set on p-values, the Bootstrap offers a similar metric, the Achieved Significance Level (ASL). The ASL is simply the percentage of Bootstrap values that are zero or less. That interpretation is very close to the one people wrongly assign to the p-value, so there’s very limited education needed: “the ASL is 3% so there’s a 3% chance that the true value is zero or less; the ASL is less than 0.05 so we can treat this result as significant”.

Conclusion

To recap, even though p-values remain ubiquitous in applied data analysis, they have overstayed their welcome. They don’t mean what people generally think they mean; and even if they did, that’s not the answer that decision-makers are looking for. Even in academia, there’s currently a strong push towards reducing the blind reliance on p-values (see the ASA statement below⁴).
Using Bootstrap confidence intervals is both easier and more compelling. They don’t rely on hidden statistical assumptions, only on a straightforward one: the overall population looks the same as our sample. They provide information on possible outcomes that is richer and more relevant to business decisions.

But wait, there’s more!

Here comes the final shameless plug. If you want to learn more about the Bootstrap, my book will show you:

How to determine the number of Bootstrap samples to draw;
How to apply the Bootstrap to regression, A/B test and ad-hoc statistics;
How to write high-performance code that will not have your colleagues snickering at your FOR loops;
And a lot of other cool things about analyzing customer data in business.

References

[1] F. Buisson, Behavioral Data Analysis with R and Python (2021). My book, obviously! The code for the example is available on the book’s Github repo.
[2] R. Wilcox, Fundamentals of Modern Statistical Methods: Substantially Improving Power and Accuracy (2010). We routinely assume that our data is normally distributed “enough.” Wilcox shows that this is unwarranted and can severely bias analyses. A very readable book on an advanced topic.
[3] See my earlier post on Medium “Is Your Behavioral Data Truly Behavioral?”.
[4] R. L. Wasserstein & N. A. Lazar (2016) “The ASA Statement on p-Values: Context, Process, and Purpose”, The American Statistician, 70:2, 129–133, DOI:10.1080/00031305.2016.1154108.

Related Post

DataScience+

Linking R and Python to retrieve financial data and plot a candlestick

Category

Tags

The Python part

The R part

Defining a pretty theme

The candlestick function

Plot the data and export the graphic

An R alternative to pairs for -omics QC

Category

Tags

Introduction

Set up the data

MA plot panel

Visualizing economic data with pretty worldmaps

Category

Tags

How to Incorporate ML.Net With Algorithmic Trading

Category

Tags

What You’ll Read:

Why Do I Think Incorporating ML.Net With Algorithmic Trading Is Worth Considering?

Step 1: Choose a Programming Language

Step 2: Determine What Field/Financial Market Your Bot Will Work Upon

Step 3: Select a Server

Step 4: Determine Your Trading Strategy.

Step 5: Deploy Your Strategy to Your Program

Summary

How I selected my starting word for Wordle using simulations and R

Category

Tags

1. Install and load lares

2. Select your starting point

3. Run simulations

4. Gather results and check winners

Final comments, asks, and considerations

Bonus benchmarks with different “best words” criteria

Forecast using Arima Model in R

Category

Tags

ARIMA Modeling

Install Packages

Import Data Set

Propagating nerve impulse in Hodgkin-Huxley model. Modeling with R. Part 2

Category

Tags

Introduction

Propagation Action Potential

Numeric solution

Diagnostics and Summary

Plot the results

Time evolution of membrane potential of H&H neuron at various distances along the axon.

3D Plot of the numeric solution

3D Plots of action Potential, and the gating variables n,m,h

Contour Plot

Shiny application

Summary and Outlook

References

Contact:

Topic Modeling and Latent Dirichlet Allocation (LDA)

Category

Tags

Topic Modeling

Latent Dirichlet Allocation (LDA)

Sampling Topics

Sampling Words

LDA Definition

Example

Visualize the topics-keywords

Further work

References

Ditch p-values. Use Bootstrap confidence intervals instead

Category

Tags

Ditch p-values

1. They don’t mean what people think they mean

2. They rely on hidden assumptions

3. They detract from the real questions

Use the Bootstrap instead

1. Install and load `lares`