Artisanal Meanderings

VSPEX Blue Fundamentals

Heeki Park — Wed, 18 Mar 2015 15:56:58 +0000

Basic Definitions
Converged Infrastructure (CI) – delivery and consumption of compute, storage, and connectivity as a complete system
Platform 2 applications – traditional, legacy applications
Platform 3 applications – cloud or scale out applications

Cloud Landscape
1st architecture: Integrated Reference Architecture (IRA) – EMC VSPEX offerings
2nd architecture: Integrated Infrastructure System (IIS) – VCE Vblock
3rd architecture: Hyper-Converged Infrastructure (HCI) – Nutanix, Simplivity, VSPEX Blue
4th architecture: Rackscale/Hyper-Rackscale (RS/HRS) – consumption of a complete rack of converged infrastructure via resource pools
5th architecture: Common Modular Building Blocks – hybrid of IIS and HCI so that storage can scale in a non-linear fashion

EMC CI in 2015
Vblock and VSPEX for Platform 2
VSPEX Blue for ROBO + VDI + SMB
Product pending announcement at EMC World based on the 5th architecture

Hyper-converged Infrastructure Appliance (HCIA)
Integration of compute, storage, and virtualization into a commodity off-the-shelf architecture
IDC expects significant growth in this market due to changing consumption models by consumers

VSPEX Blue
Claims power on to VM creation in 15 minutes
EMC’s value claim – 15 minutes from power on to VM deployment, simple management, linear scalability, single point of support
Powered by VMware EVO:RAIL
Four configurations – 2 standard, 2 performance
Software integrations
RecoverPoint for VMs (for replication)
VMware Data Protection Advanced (for back,up, Avamar and Data Domain under the covers)
EMC CloudArray gateway (for access to cloud storage)
ESRS (for integration with EMC support)
Scales from one to four 2U/4-node appliances

Competition
Nutanix – current market leader
Simplivity
Both have strong partner networks in most geographies

Component 1: EMC VSPEX Blue Hardware
Per standard node specifications:
Processor – Dual Intel Ivy Bridge E5-2620v2 (12 core, 2.1Ghz)
Memory – 128GB
Storage – 32GB SLC SATADOM, 400GB eMLC 2.5″, 3x 1.2TB 10k HDD
Network – 2x 10GbE SFP+ or 10GBase-T RJ-45, 1GbE (RMM)

Per performance node specifications (changes only):
Memory – 192GB

4 appliances – 16 server nodes
3 appliances – 12 server nodes
2 appliances – 8 server nodes
1 appliances – 4 server nodes

Each appliance can support about 100 virtual machines or 250 Horizon View virtual desktops
Appliances are deployed whole (server nodes will always be a multiple of 4)
Each appliance requires 8x 10GbE network ports
All 4 nodes in the appliance must be connected to the 10GbE switch
2x 10GbE ports per node, hence 8x 10GbE for all 4 nodes
Configuring a 10GbE top-of-rack switch is recommended
Enable multicast, IGMP snooping, and IGMP querier

Component 2: VMware EVO:RAIL Software
Browser-agnostic HTML5 GUI
Configures vCenter Server
Configures ESXi hosts and VSAN for all nodes
Configures the network per user input
Default provisioning policies by size of virtual machines and security
Link to the vCenter Web Client is provided in the EVO:RAIL GUI as well
Automatically discovers each appliance and its nodes on the network and requires a single click to confirm/add to an existing cluster
Patches and software updates with no workload downtime via systematic vMotions

Component 3: EMC Software
VSPEX Blue Manager
EMC 24×7 support
EMC RecoverPoint
VDPA and Data Domain
EMC CloudArray

CPUID Flags with Windows 2012 Virtual Machines

Heeki Park — Fri, 10 Jan 2014 16:50:24 +0000

I was recently working with a customer who was trying to deploy Windows 2012 R2 virtual machines on a Cisco UCS cluster running ESXi 5.1.1. However, the customer was getting the following error:

“Your PC needs to restart. Please hold down the power button. Error Code: 0x000000C4”

I went through the process myself and didn’t encounter any issues with the default settings. After running through the process with the customer, I found that he was changing the CPUID Mask from “Expose the NX/XD flag to guest” to “Hide the NX/XD flag from guest” in order to increase vMotion compatibility to other non-UCS ESXi hosts.

After resetting the flag to “Expose the NX/XD flag to guest”, the error went away, and the Windows 2012 R2 guest OS was able to boot without issue.

Apache Hadoop Deployment

Heeki Park — Mon, 06 Jan 2014 21:13:17 +0000

It’s been a while since posting here. I figured I’d start the new year with a new post on what I spent much of the latter part of 2013 researching in my spare time. As a result of my reading on big data and data analytics, I built an Apache Hadoop 2.2.0 cluster in my lab on virtual machines. I chose to go with the vanilla Apache distribution rather than Hortonworks Sandbox or VMware Serengeti as I took the manual process of installation as an opportunity to learn the components and internals of the environment. Below is a compilation of tutorials and my own tinkering as I built the environment from scratch. A lot of this tutorial was gleaned from Michael Noll’s tutorial for deploying Hadoop in Ubuntu with my own adaptation for CentOS.

Installation Pre-requisites:

I built the whole environment on CentOS 6 (i386) as a virtual machine. I chose to create a virtual machine template to keep the process of adding new nodes into the cluster simple. I chose the i386 version for two reasons: 1) I was only using 2GB RAM on my virtual machines and 2) the pre-compiled Apache 2.2.0 distribution was compiled as 32-bit binaries. Yes, I could have compiled from source for 64-bit but I was just trying to keep it simple.

Next are some of the preparation steps I took in the CentOS virtual machine build.

Pre-configure DNS resolution for all of the nodes that would be added to the cluster
Disable IPv6

/etc/sysctl.conf
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1

Edit /etc/hosts and remove the ::1 entry
Reboot the server and check if IPv6 is disabled

cat /proc/sys/net/ipv6/conf/all/disable_ipv6

Install and configure Java. Download the latest Java installation and copy it to /var/tmp

cd /var/tmp
tar xvfz jdk-7u45-linux-i586.tar.gz -C /opt

# edit /root/.bashrc
export JAVA_HOME=/opt/jdk1.7.0_45
export JRE_HOME=/opt/jdk1.7.0_45/jre
export PATH=$PATH:/opt/jdk1.7.0_45/bin:/opt/jdk1.7.0_45/jre/bin

# set the path link:
# alternatives --install [link] [name] [path] [priority]
[root@hadoop01 bin]# alternatives --install /usr/bin/java java /opt/jdk1.7.0_45/bin/java 2
[root@hadoop01 bin]# alternatives --config java
There is 1 program that provides 'java'.
 Selection    Command -----------------------------------------------
 *+ 1           /opt/jdk1.7.0_45/bin/java
 Enter to keep the current selection[+], or type selection number: 1

# check the Java version: java -version
[root@hadoop01 ~]# java -version
java version "1.7.0_45"
Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
Java HotSpot(TM) Client VM (build 24.45-b08, mixed mode)

Hadoop Configuration and Setup:

Now it’s time to actually install and configure the Hadoop components:

Create a Hadoop user account and group as root

groupadd hadoop
useradd -g hadoop hadoop
id hadoop

Add the hadoop user to /etc/sudoers

hadoop    ALL=(ALL)       ALL

Configure key-based login via SSH as the hadoop user

su - hadoop
ssh-keygen -t rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

[hadoop@hadoop01 ~]$ ssh-keygen -t rsa 
Generating public/private rsa key pair. 
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): 
Created directory '/home/hadoop/.ssh'. 
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/hadoop/.ssh/id_rsa. 
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub. 
The key fingerprint is: c6:97:3a:39:a7:66:8f:9a:9b:bc:f1:0a:8c:29:b4:63 hadoop@hadoop01.nycstorm.lab 
The key's randomart image is: 
+--[ RSA 2048]----+ 
|                 | 
|                 | 
|                 | 
|       .   .     | 
| .      S o      | 
|. .+   . +       | 
|.Eo o . = .      | 
|...  o =o*       | 
|      OB+..      | 
+-----------------+ 
[hadoop@hadoop01 ~]$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys 
[hadoop@hadoop01 ~]$ chmod 0600 ~/.ssh/authorized_keys

Update the ~/.bashrc script with all the necessary environment variables for the hadoop user:

export JAVA_HOME=/opt/jdk1.7.0_45
export JRE_HOME=/opt/jdk1.7.0_45/jre
export PATH=$PATH:/opt/jdk1.7.0_45/bin:/opt/jdk1.7.0_45/jre/bin
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"

Optionally export the following setting, which in this case isn’t necessary since we’ve disabled IPv6. If IPv6 is not disabled, the following setting can be used.

export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

Download the Hadoop files

yum install wget
cd /var/tmp
wget http://apache.mesi.com.ar/hadoop/common/hadoop-2.2.0/hadoop-2.2.0.tar.gz
tar xvfz hadoop-2.2.0.tar.gz -C /opt
cd /opt
chown -R hadoop:hadoop hadoop-2.2.0
ln -s hadoop-2.2.0 hadoop

Update the Hadoop configuration files in the /opt/hadoop/etc/hadoop directory

# core-site.xml
   
     fs.default.name
     hdfs://hadoop01:9000
   

# hdfs-site.xml 
# dfs.replication is the number of replicas of each block 
# dfs.name.dir is the path on the local fs where namenode stores the namespace and transactions persistently 
# dfs.data.dir is the comma-separated list of paths on the local fs of the datanode where it stores its blocks
  
    dfs.replication
    2
  
  
    dfs.name.dir
    file:///opt/hadoop/data/dfs/name
  
  
    dfs.data.dir
    file:///opt/hadoop/data/dfs/data
  

# run the following command to copy from the template to the actual file that we will edit
cp mapred-site.xml.template mapred-site.xml

# mapred-site.xml 
# mapreduce.jobtracker.address for the jobtracker host 
# mapreduce.system.dir where mapreduce stores system/control files 
# mapreduce.local.dir where mapreduce stores temp/intermediate files
  
    mapred.job.tracker
    hadoop01:9001
  
  
    mapred.system.dir
    /opt/hadoop/data/mapred/system/
  
  
    mapred.local.dir
    /opt/hadoop/data/mapred/local/
  

# yarn-site.xml
  
    yarn.resourcemanager.hostname
    hadoop01
  
  
    yarn.resourcemanager.address
    hadoop01:8032
  

# slaves
# delete localhost and add all the names of the tasktrackers, each in one line for now just:
hadoop01

Creating the clone:

At this point, the node is ready to be configured as a virtual machine template to be cloned. However, to do so, we need to prepare the OS to be cloned. Perform the following which was taken from this lonesysadmin site. These steps should be completed and then the virtual machine should be shut down.

/usr/bin/yum clean all
/bin/cat /dev/null > /var/log/audit/audit.log
/bin/cat /dev/null > /var/log/wtmp
/bin/rm -f /etc/udev/rules.d/70*
/bin/sed -i '/^\(HWADDR\|UUID\)=/d' /etc/sysconfig/network-scripts/ifcfg-eth0
/bin/rm -f /etc/ssh/*key*
/bin/rm -f /home/hadoop/.ssh/*
/bin/rm –Rf /tmp/*
/bin/rm –Rf /var/tmp/*
/bin/rm -f ~root/.bash_history
unset HISTFILE

However, each time a new clone/node is added to the cluster, a few updates need to be updated. I’ve created a script to initialize a newly cloned node. This script should be created in the hadoop user’s home directory. It can be run from the hadoop user using the following syntax: ~/initialize.sh

# initialize.sh on new node
#!/bin/bash
HOSTNAME=$1
IPADDR=$2
if [ -z "$HOSTNAME" ]
 then
   echo usage: initialize.sh hostname ipaddr
   exit 1
 fi
if [ -z "$IPADDR" ]
 then
   echo usage: initialize.sh hostname ipaddr
   exit 1
 fi
sudo /bin/sed -i "s/HOSTNAME=hadoop.nycstorm.lab/HOSTNAME=$HOSTNAME/" /etc/sysconfig/network
sudo /bin/sed -i "s/IPADDR=192.168.11.49/IPADDR=$IPADDR/" /etc/sysconfig/network-scripts/ifcfg-eth0
grep HOSTNAME /etc/sysconfig/network
grep IPADDR /etc/sysconfig/network-scripts/ifcfg-eth0
echo
echo ######################################################################
echo # NOTICE: $HOSTNAME needs reboot now for hostname -f to take effect. #
echo # #
echo # SERVICE: hadoop-daemon.sh start datanode #
echo # SERVICE: yarn-daemon.sh start nodemanager #
echo # #
echo ######################################################################
echo

Also, some updates need to be made to the master node each time a node is added. I’ve created the following script to handle that:

#!/bin/bash
ADDNODE=$1
if [ -z "$ADDNODE" ]
then
  echo usage: addnode.sh hostname
  exit 1
fi
echo $ADDNODE >> /opt/hadoop/etc/hadoop/slaves
scp ~/.ssh/id_rsa.pub hadoop@$ADDNODE:/home/hadoop/id_rsa.pub.hadoop01
ssh hadoop@$ADDNODE "cat id_rsa.pub.hadoop01 >> .ssh/authorized_keys; chmod 644 .ssh/authorized_keys"

Starting Services:

The following are a few commands for starting overall cluster services and for checking status:

# format the hdfs filesystem
/opt/hadoop/bin/hdfs namenode -format
# start the hdfs services
/opt/hadoop/sbin/start-dfs.sh
# start the tasktracker services
/opt/hadoop/sbin/start-yarn.sh
# check the status of all services
jps
# if all starts properly, you will see the following:
 2583 DataNode
 2970 ResourceManager
 3461 Jps
 3177 NodeManager
 2361 NameNode
 2840 SecondaryNameNode

Note that on each individual data node, you can start the datanode and nodemanager services via the following commands:

hadoop-daemon.sh start datanode
yarn-daemon.sh start nodemanager

From the master node, you can check the status of all nodes:

[hadoop@hadoop01 ~]$ yarn node -list
13/11/18 13:07:44 INFO client.RMProxy: Connecting to ResourceManager at hadoop01/192.168.11.50:8032
Total Nodes:10
Node-Id             Node-State Node-Http-Address       Number-of-Running-Containers
hadoop04.nycstorm.lab:52906     RUNNING hadoop04.nycstorm.lab:8042     0
hadoop03.nycstorm.lab:44443     RUNNING hadoop03.nycstorm.lab:8042     0
hadoop08.nycstorm.lab:42321     RUNNING hadoop08.nycstorm.lab:8042     0
hadoop10.nycstorm.lab:53675     RUNNING hadoop10.nycstorm.lab:8042     0
hadoop07.nycstorm.lab:33923     RUNNING hadoop07.nycstorm.lab:8042     0
hadoop01.nycstorm.lab:48101     RUNNING hadoop01.nycstorm.lab:8042     0
hadoop02.nycstorm.lab:60853     RUNNING hadoop02.nycstorm.lab:8042     0
hadoop05.nycstorm.lab:39854     RUNNING hadoop05.nycstorm.lab:8042     0
hadoop09.nycstorm.lab:45020     RUNNING hadoop09.nycstorm.lab:8042     0
hadoop06.nycstorm.lab:35679     RUNNING hadoop06.nycstorm.lab:8042     0

You can then upload some files and take a look at the status of your HDFS filesystem.

[hadoop@hadoop01 data]$ hdfs dfs -ls /data
Found 1 items
-rw-r--r-- 2 hadoop supergroup 284806 2013-11-26 13:02 /data/pg16_peter_pan.txt
[hadoop@hadoop01 data]$ hdfs dfs -copyFromLocal * /data
copyFromLocal: `/data/pg16_peter_pan.txt': File exists
[hadoop@hadoop01 data]$ hdfs dfs -ls /data
Found 14 items
-rw-r--r-- 2 hadoop supergroup 167517 2013-11-26 15:25 /data/pg11_alice_in_wonderland.txt
-rw-r--r-- 2 hadoop supergroup 3322651 2013-11-26 15:25 /data/pg135_les_miserables.txt
-rw-r--r-- 2 hadoop supergroup 284806 2013-11-26 13:02 /data/pg16_peter_pan.txt
-rw-r--r-- 2 hadoop supergroup 1257274 2013-11-26 15:25 /data/pg2701_moby_dick.txt
-rw-r--r-- 2 hadoop supergroup 90701 2013-11-26 15:25 /data/pg41_sleepy_hollow.txt
-rw-r--r-- 2 hadoop supergroup 1573150 2013-11-26 15:25 /data/pg4300_ulysses.txt
-rw-r--r-- 2 hadoop supergroup 181997 2013-11-26 15:25 /data/pg46_a_christmas_carol.txt
-rw-r--r-- 2 hadoop supergroup 1423803 2013-11-26 15:25 /data/pg5000_notes_of_leonardo_davinci.txt
-rw-r--r-- 2 hadoop supergroup 141419 2013-11-26 15:25 /data/pg5200_metamorphosis.txt
-rw-r--r-- 2 hadoop supergroup 421884 2013-11-26 15:25 /data/pg74_adventures_of_tom_sawyer.txt
-rw-r--r-- 2 hadoop supergroup 610157 2013-11-26 15:25 /data/pg76_adventures_of_huckleberry_finn.txt
-rw-r--r-- 2 hadoop supergroup 142382 2013-11-26 15:25 /data/pg844_the_importance_of_being_earnest.txt
-rw-r--r-- 2 hadoop supergroup 448689 2013-11-26 15:25 /data/pg84_frankenstein.txt
-rw-r--r-- 2 hadoop supergroup 641414 2013-11-26 15:25 /data/pg8800_the_divine_comedy.txt
[hadoop@hadoop01 data]$
[hadoop@hadoop01 data]$ hdfs dfsadmin -report
Configured Capacity: 211378749440 (196.86 GB)
Present Capacity: 195689984139 (182.25 GB)
DFS Remaining: 195668267008 (182.23 GB)
DFS Used: 21717131 (20.71 MB)
DFS Used%: 0.01%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

Running on Hadoop

I then wrote a mapper.pl and reducer.pl script to do the simple word count example and ran that against the files that I uploaded to HDFS. With those Perl files, I then used the streaming API to run a Hadoop 2.0 job.

hadoop jar /opt/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar \
-mapper /home/hadoop/src/mapper.pl \
-reducer /home/hadoop/src/reducer.pl \
-input /data/pg16_peter_pan.txt -output /output/pg16_peter_pan

Below is the Perl code that I used for the mapper.pl and reducer.pl scripts. That can be adapted to Python or anything else. You can also test the code by running: cat | mapper.pl | reducer.pl

[hadoop@hadoop01 src]$ cat mapper.pl
 #!/usr/bin/perl
mapper();
sub mapper {
  foreach my $line () {
    chomp($line);
    $line =~ s/[.,;:?!"()\[\]]//g;
    $line =~ s/--/ /g;
    my @words = split(/\s+/, $line);
    foreach $word (@words) {
      print "$word\t1\n";
    }
  }
}
[hadoop@hadoop01 src]$ cat reducer.pl
#!/usr/bin/perl
reducer();
sub reducer {
  my %hash;
  foreach my $line () {
    chomp($line);
    my ($key,$value) = split(/\t/, $line);
    if (defined $hash{$key}) {
      $hash{$key} += $value;
    } else {
      $hash{$key} = 1;
    }
  }
  foreach my $key (keys %hash) {
    print "$key\t$hash{$key}\n";
  }
}

Status

There a few web consoles to look at the status of your Hadoop grid:

http://localhost:50070/ – web UI of the NameNode daemon
http://localhost:50030/ – web UI of the JobTracker daemon
http://localhost:50060/ – web UI of the TaskTracker daemon

Troubleshooting

In the process of building the cluster, I ran into a number of issues. Below are some of the issues and the resolution to those issues.

The first issue I ran into was connectivity between the master node and the data nodes.

2013-11-15 17:24:18,463 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting to server: hadoop01/192.168.11.50:9000
2013-11-15 17:24:24,465 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop01/192.168.11.50:9000. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)

This was resolved as the iptables firewall on the master node was blocking access from all data nodes. I didn’t spend the time to add the proper firewall rules in. Instead I chose to either disable iptables or delete all the firewall rules. Not secure, but again, this exercise was for the purpose of learning Hadoop, not deploying a production cluster.

iptables --list
iptables --flush (deletes all rules)
/etc/init.d/iptables stop (stops the iptables service)
chkconfig iptables off (disables iptables from starting on boot)

I also had issues writing to HDFS due to SElinux security. Unfortunately I didn’t capture the log entry for that error but it was a write issue to HDFS. I ran the following to rectify that issue.

sed -i 's/SELINUX=enforcing/SELINUX=disabled/' /etc/selinux/config
# force stop selinux without a reboot, alternatively just reboot
setenforce 0

As part of troubleshooting the HDFS write issues above, I ended up reformatting HDFS and causing some issues there. If you ever reformat HDFS, then you need to delete the dfs.data.dirs directory too. You will see the following incompatible clusterIDs error messages in the datanode logs:

0831241-192.168.11.50-50010-1385486041683) service to hadoop01/192.168.11.50:9000
java.io.IOException: Incompatible clusterIDs in /opt/hadoop-2.2.0/data/dfs/data: namenode clusterID = CID-8b249a29-681f-4417-a464-a849d3a9cc9c; datanode clusterID = CID-472489cb-19b3-4381-8572-c9bf7bf5db64
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:391)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:191)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:219)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:837)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:808)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:280)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:222)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664)
        at java.lang.Thread.run(Thread.java:744)
2013-11-26 12:19:36,471 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool BP-1806045400-192.168.11.50-1385486221792 (storage id DS-1520831241-192.168.11.50-50010-1385486041683) service to hadoop01/192.168.11.50:9000
2013-11-26 12:19:36,584 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool BP-1806045400-192.168.11.50-1385486221792 (storage id DS-1520831241-192.168.11.50-50010-1385486041683)
2013-11-26 12:19:38,584 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
2013-11-26 12:19:38,586 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 0
2013-11-26 12:19:38,596 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at hadoop01.nycstorm.lab/192.168.11.50
************************************************************/

That’s all. At this point you should have a running Hadoop cluster and a job run against the cluster.

PowerPath/Migration Enabler

Heeki Park — Fri, 22 Jun 2012 17:19:39 +0000

Introduction:
Performing data migrations is something every systems and storage administrator has had to deal with at some juncture in his/her career. Many migration techniques require downtime for the application. Block-level techniques like SAN Copy (CLARiiON/VNX) or Open Replicator (Symmetrix) require bringing the application down in order to perform the cutover. There are other techniques that do not require downtime. Some examples of this are host-based volume manager or filter driver techniques – Veritas Volume Manager (VxVM), PowerPath/Migration Enabler (PP/ME), or something like Federated Live Migration between two Symmetrix arrays (uses both PowerPath and Open Replicator).

In this blog, I’ll cover some basics and gotchas for PP/ME.

PowerPath Migration Enabler takes advantage of the multi-pathing capabilities within PowerPath. PowerPath functions as a filter driver within the I/O stack on a server. When an application writes down to a volume, it will take the following path down to storage:

File system (Windows drive letter or Unix/Linux mounted file system) or Database raw partition
Logical volume manager
PowerPath (pseudo devices)
SCSI driver (native devices)
SCSI controller or HBA
Storage controller via SAN or direct connection

For a Symmetrix volume, you will typically have two paths for a single device. The server will then have a native device for each of those two paths. From a server perspective, it looks like two different physical devices, but we know that they represent the same device. PowerPath then creates a pseudo device that then encapsulates those native devices as a single device. Note: on a Unix/Linux server you could technically still address a native device, but PowerPath will still intercept the I/O and intelligently load balance as is appropriate.

Why the explanation on PowerPath native and pseudo devices? It’s because PP/ME takes advantage of its position in the I/O stack to copy data to new devices in the background, by copying/mirroring data from one native device to a new native device (similar to VxVM plex mirroring). That activity is transparent to the file system above it. This is why you can perform a migration/cutover to a new storage array while keeping the application online.

The rest of the blog will focus on specifics within a Windows environment, but the same concepts can be applied to Unix/Linux servers.

Pre-requisites:
Below are some of the requirements prior to migrating with PP/ME.

PowerPath 5.3.1 at minimum must be installed.
- If you already have this version or later but did a typical install, you likely won’t have PP/ME installed.
- To get PP/ME installed, just run the installer again, choose the custom installation option and then select the PP/ME option.
- In the case where you already had PowerPath installed and are just getting the PP/ME feature added, a reboot is not required but it is recommended. I’ve seen some weird issues with the service without that reboot.
You will also require the HostCopy license for PP/ME. It can be obtained for free from EMC, assuming you already have the base PowerPath license and a current maintenance contract.
Once all of these are complete, you can verify that you have it by running the powermig command in Command Prompt.

Setting up the Source/Target Mapping:
Next you need to know what the source and destination LUNs should be. The main way would be via the PowerPath Console. In the illustration below, I’ve moved the columns to make it easier to see. You’ll also note that the 3rd device is failed. That was an old CX4 LUN that was reclaimed but not yet cleaned up, which enabled me to at least get a view of both Symmetrix and CLARiiON LUNs in PowerPath.

Disk Number – will correlate to the physical drive in Windows disk management
Device – for Symmetrix devices, it will give you the symdev and for CLARiiON/VNX devices, it will give you the UID.
LUN Name – for Symmetrix devices, it will give you nothing and for CLARiiON/VNX devices, it will give you the name of the LUN. If you have the ALU in the name, then it becomes easy to map.

Host Resources and Throttling:
During the synchronization process, host resources will be used to perform the copy. Depending on what is going on with the server, you may want to throttle the host resources that PP/ME is allowed to use. Below are some of the settings that you can use. The percentage represents the percentage of time the host spends copying data.

0: 100%
1: 60%
2: 36% (default)
3: 22%
4: 13%
5: 7.8%
6: 4.7%
7: 2.8%
8: 1.7%
9: 1.0%

Migration Process:
And now the process. As stated earlier, the migration process does not require bringing the application down. While the migration should still be done in a maintenance window with appropriate notice to the user/business community, the application will take no downtime.

Present the new target storage
Setup the PP/ME sessions using the mapping technique above
Perform initial synchronization prior to the migrations with the appropriate throttle settings
During a maintenance window:
- Switch targets to the new storage and commit the changes
- Remove the old storage
- Perform UAT

Below are commands required for setting up, executing and cleaning up PP/ME.

# create a session, will return a handle #
powermig setup -src harddiskXX -tgt harddiskYY -techType hostcopy

# start the background copy, will enter sourceSelected state when complete
powermig sync -handle 1

# monitor status of the session
powermig query -handle 1

# set the throttle per the settings stated above
powermig throttle -throttleValue 0 -handle 1

# switch over to the new storage, still mirrors back to the original
# will enter targetSelected state
powermig selectTarget -handle 1

# stop mirroring to the original, will enter committed state
powermig commit -handle 1

# cleanup/delete the session
powermig cleanup -handle 1

Gotchas:
Below are some gotchas that I’ve seen in my experience with PP/ME:

For Windows servers, you need to make sure that the syntax of the source and target devices are “harddiskXX” where XX maps to the PHYSICALDRIVE number, found in PowerPath or Disk Management. The symdev, LUN name, LUN ID, or LUN UID will not work.
The synchronization time is largely dependent on host resources and the underlying storage. You could set throttle to 0 (allowing 100% host resources), but if there is a lot going on the box, PP/ME will be competing with other activity to perform the background copies. Hence, you need to find appropriate times to perform the initial synchronization.
PP/ME is supported with MSCS clusters. However, it is critical that no resource failovers happen at any point during a PP/ME session. Why? Because once the new LUN is mirrored, it not only has the same data on it, it also has the same disk signature. In the event of a cluster failover, the secondary node will not know about the PP/ME source/target relationship. Therefore, the cluster node will think that the server has two different LUNs with the same disk signature. Because the target LUN is write-disabled (locked by PP/ME on the other node), Windows will then re-signature the source LUN. With a new signature on the source LUN, the cluster will no longer recognize that LUN in the cluster, and therefore the cluster will fail to come back online. You will need to use tools like diskmap and dumpcfg to manually resignature the source LUN back to the original disk signature.

RecoverPoint Initial Synchronization with DD

Heeki Park — Wed, 02 May 2012 21:21:10 +0000

The Downlow
While some companies may be equipped with an abundance of bandwidth between their production and disaster recovery sites, many others are limited with their site-to-site bandwidth. As such, many implement data replication technologies that also perform data compression, de-duplication, and even fast-write capabilities in IP and fibre channel protocols.

In this particular case study, I’m working with a customer with two data centers, New York and Washington DC with a 50 Mb/s line between the two data centers. EMC RecoverPoint is the replication technology of choice, and the customer is doing bi-directional replication. The Washington DC site has about 4TB of data that needs to replicate to New York, and the New York site had about 10TB of data that needs to replicate to Washington DC.

In a perfect world (no latency, no packet loss, 100% utilization of the link), it would take roughly 7 days to replicate the 4TB and roughly 18 days to replicate the 10TB. That’s almost a month to move the data with the link fully saturated. Unfortunately, that link was also used for other business uses, e.g. VoIP traffic, internal application traffic, server monitoring traffic, etc. Thus the CIO mandated that we find another method to perform the initial synchronization, as using the link (even throttled) was not an option for this duration.

The Approach
The EMC RecoverPoint Release 3.4 Administrator Guide (P/N 300-012-256) documents a method for performing first-time initialization from backup. The primary kicker here though, is that the backup must be a block-level backup, not a file-level backup. This is because the target RecoverPoint image will be seeded with that block-level backup and then RecoverPoint will perform a full volume sweep to synchronize the incremental changes since the block-level backup.

Most companies, however, do not perform block-level backups of their servers. Rather, they perform file-level backups, which then gets catalogued for easy restores. Below is a summary of the process I used to perform the RecoverPoint initial synchronization using dd as the block-level backup.

Pre-requisites/Setup

Downloaded dd on a Windows utility server
- http://www.chrysocome.net/downloads/dd-0.6beta3.zip
- This is the tool we will use for the block-level backups.
```
dd if=[vol_source] of=[vol_target] bs=512k
```
- Note: I did some very rudimentary performance tests to see what block size would be optimal for these backups. I found 512k to be the sweet spot.
Configured clones for all volumes that will be seeded with RecoverPoint. The main reason for this is two fold:
- I didn’t want to impact the performance of the production volume while dd reads from the source volume to create the backup.
- dd cannot operate against volumes with open files. Thus, we’d need to bring down the applications for the duration of the dd backup. When performing a dd against a mounted clone and against the PHYSICALDRIVE address, I did not get open file errors. Below is an example of the errors you will see with dd if there are open files.
```
C:\Utilities>dd if=\\.\H: of=z:\testvolume.img bs=512k
rawwrite dd for windows version 0.6beta3.
Written by John Newbigin 
This program is covered by terms of the GPL Version 2.
Error opening input file: 32 The process cannot access the file because it is being used by another process
```
- The source volumes were on 15k drives and the target volumes were on 7.2k SATA drives. I was able to copy roughly 5 GB/min (+/-0.5) with this process.

The Process

For new source volumes, confirm that all the data has first been migrated before proceeding.
Configure the consistency group(s) for the volumes in scope
- When finishing the consistency group, do not start the transfer. Leave the transfer paused.
Right-click the consistency group, select “Clear Markers”
- This will let RP know that the remote site is known to be identical to its corresponding production volume. Thus a full volume sweep is not required.
- When the dialog box pops up, select both copies.
- Note: had to do this via command line because the GUI was only letting me clear the markers in the DR location. The command line without the copy=XYZ option allows you to clear all markers on both sides.
```
clear_markers group=RPSyncTest
```
Create the block-level copy with dd
Transfer the copy to the secondary site. In our case, we shipped the USB drives to the secondary site.
Enable image access on the secondary volume
- Select the latest image
- After access goes to logged access, enable direct access
Restore the backup to the secondary volume
- Remember, you already did the clear markers before you did the first dd copy. If you do it again, it will mess up tracking where the replication should resume.
- No need to give the drive a drive letter or format it, as you can access it via the \\.\PHYSICALDRIVE2 address.
Disable image access and start the transfer
- Check the “Start data transfer immediately” checkbox to resume replication
- Monitor the consistency group. The traffic you see will be the changes to the source volume since the block-level dd copy was made. The duration should be significantly less than if it was a full copy, depending on how much data has changed since the original dd backup.

The Results
Below are some of the results from the initial synchronization process. Note that between the dd on the source and the reverse dd on the secondary volume, roughly two days elapsed.

330GB consistency group 1
- At 50 Mb/s, it would have taken roughly 15 hours to perform a full sync.
- Initial synchronization took 58 minutes, transferring roughly 21GB (6.36%).
- We saved roughly 14 hours and 309GB of transfer.
330GB consistency group 2
- Initial synchronization took 43 minutes, transferring roughly 10GB (3.03%).
- We saved a little over 14 hours and 320GB of transfer.
330GB consistency group 3
- Initial synchronization took 50 minutes, transferring roughly 18GB (5.45%).
- We saved a little over 14 hours and 312GB of transfer.
330GB consistency group 4
- Initial synchronization took 53 minutes, transferring roughly 19GB (5.76%).
- We saved a little over 14 hours and 311GB of transfer.

I would post the initialization graphs of the above consistency groups, but the window for the graphs is 5 minutes and would just show constant transfer. Instead, below are graphs of a 1GB test volume that I configured.

This is a graph of the initialization traffic without the data pre-seeded. Note that the green line for site-to-site traffic hovers between 35-50 Mb/s for almost 2 minutes.

This is a graph of the initialization traffic with the data pre-seeded. Note that the green line spikes for a short duration of time to do the full volume sweeps but lasts for only 30 seconds.

The Commands

plink -l admin -pw admin 192.168.10.10 "enable_group group=RPSyncGroup start_transfer=no"
plink -l admin -pw admin 192.168.10.10 "clear_markers group=RPSyncGroup"
dd if=\\.\[PHYSICALDRIVE##] of=z:\[PHYSICALDRIVE##.img] bs=512k
[transfer the images to the secondary site via USB drive]
plink -l admin -pw admin 192.168.10.10 "enable_image_access group=RPSyncGroup copy=DR_RPSyncGroup image=latest"
plink -l admin -pw admin 192.168.10.10 "set_image_access_mode group=RPSyncGroup copy=DR_RPSyncGroup mode=direct"
dd if=z:\[PHYSICALDRIVE##.img] of=\\.\[PHYSICALDRIVE##] bs=512k
plink -l admin -pw admin 192.168.10.10 "disable_image_access group=RPSyncGroup copy=DR_RPSyncGroup start_transfer=no"
plink -l admin -pw admin 192.168.10.10 "start_transfer group=RPSyncGroup"
[monitor initial synchronization traffic]

Apache on CentOS 6.2 with Sub-directories

Heeki Park — Thu, 19 Apr 2012 16:09:50 +0000

I built a CentOS 6.2 virtual machine on my VMware Workstation as a utility server (192.168.1.135). I used the CentOS-6.2-i386-minimal.iso to do the install and then installed a LAMP stack on it. After that, the next step was to get phpMyAdmin to manage the MySQL database. I did the following to do so:

1. Downloaded the latest package from http://www.phpmyadmin.net/home_page/downloads.php onto my laptop (192.168.1.119).
2. Used WinSCP to copy the file to my home directory.
3. Logged in and sudo’ed to root.
4. Copied the file from my home directory to /var/www/html, untarred the package, and renamed the directory to phpmyadmin.
5. I then went to access the server at http://192.168.1.135/phpmyadmin. I then encountered the following 403 error.

Forbidden
You don’t have permission to access /phpmyadmin on this server.

The error logs (/var/log/httpd/error_log) showed the following:

[Thu Apr 19 06:28:22 2012] [error] [client 192.168.1.119] (13)Permission denied: access to /phpmyadmin/ denied

I then sought the counsel of Google. Many web sites talk about either permissions on the directory/files or the httpd.conf configuration. My issue was none of those. It had to do with selinux which apparently comes built into the minimal CentOS 6.2 install.

[root@sandbox conf]# yum list | grep selinux
libselinux.i686 2.0.94-5.2.el6 @anaconda-CentOS-201112130233.i386/6.2
libselinux-utils.i686 2.0.94-5.2.el6 @anaconda-CentOS-201112130233.i386/6.2
selinux-policy.noarch 3.7.19-126.el6 @anaconda-CentOS-201112130233.i386/6.2
selinux-policy-targeted.noarch 3.7.19-126.el6 @anaconda-CentOS-201112130233.i386/6.2
ipa-server-selinux.i686 2.1.3-9.el6 base
libselinux-devel.i686 2.0.94-5.2.el6 base
libselinux-python.i686 2.0.94-5.2.el6 base
libselinux-ruby.i686 2.0.94-5.2.el6 base
libselinux-static.i686 2.0.94-5.2.el6 base
pki-selinux.noarch 9.0.3-21.el6_2 updates
selinux-policy.noarch 3.7.19-126.el6_2.10 updates
selinux-policy-doc.noarch 3.7.19-126.el6_2.10 updates
selinux-policy-minimum.noarch 3.7.19-126.el6_2.10 updates
selinux-policy-mls.noarch 3.7.19-126.el6_2.10 updates
selinux-policy-targeted.noarch 3.7.19-126.el6_2.10 updates

The problem was that the phpmyadmin package that I copied via WinSCP took the wrong context, which therefore didn’t have the appropriate permissions for apache to display.

[root@sandbox html]# ls -Z
-rw-r--r--. root root unconfined_u:object_r:httpd_sys_content_t:s0 info.php
drwxr-xr-x. root root unconfined_u:object_r:user_tmp_t:s0 phpmyadmin

To fix this, I needed to do the following:

chcon -R -t httpd_sys_content_t phpmyadmin

Note: be sure to use the -R to recursively apply that context against all files. Otherwise you will get a server misconfiguration error.

[root@sandbox html]# ls -Z
-rw-r--r--. root root unconfined_u:object_r:httpd_sys_content_t:s0 info.php
drwxr-xr-x. root root unconfined_u:object_r:httpd_sys_content_t:s0 phpmyadmin

In retrospect, had I downloaded the file via wget directly into the /var/www/html directory, it would have already taken the proper context, and I would not have had the issue.

Coming Soon

Heeki Park — Mon, 16 Apr 2012 22:06:35 +0000

It’s been a while since posting here but I’ll try to start posting some new info up on this blog. Stay tuned!

Best Practices for Symmetrix Configuration

Heeki Park — Thu, 13 May 2010 15:58:02 +0000

Considerations

Configure enough resources for your workload
Use resources evenly for best overall performance
- Spread across all available components
- Includes FE, BE and disks
- Path management can help FE
- FAST/Optimizer can help BE

Commonly asked questions

What size system do I need?
- Each resource has a limit of I/Os per second and MBs per second
  - Disks
  - Back-end controllers (DAs)
  - Front-end controllers (Fibre, FICON, GigE)
  - SRDF controllers
  - Slices (CPU complexes)
- Configure enough components to support workload peaks
- Use those resources as uniformly as possible
- CPU utilization
  - As a rule of thumb, a limit of no more than 50-70% utilization is good if response time is critical
  - A higher utilization can be tolerated if only IOPS or total throughput matters
- Memory considerations
  - Ideal to have same size memory boards and same memory between engines
  - Imbalance will make little or no difference with OLTP type workloads
  - Imbalance will create more accesses to boards or engines with large amount of memory, creating a skewed distribution over the hardware resources
- Front-end connections
  - Go wide before you go deep
    - Use all 0 ports on director first and then the 1 ports
    - Spread across directors first, then on same director
    - Two active ports on one FA slice do not generally do more I/Os
  - Ratios (random read hit normalized at 1)
    - Random read hit 1
    - Random read miss 1/2
    - Random Overwrite I/O’s 1/2
    - Random new write 1/4
  - Worst connection for a host with 8 connections
    - All on one director
    - Instead do one connection per director
- Disks
  - Performance will scale linearly as you add drives
    - You can see up to 510 IOPS per drive when benchmarking at 8KB, but 150 IOPS is a reasonable design number for real world situations
  - Note that with higher IOPS comes higher response times as well as queues will grow
  - Until some back-end director limit is reaches
  - With smaller I/O sizes (<32KB), the limit reaches is the CPU limit
  - With largest I/O sizes (>32KB), we can reach a throughput limit in the plumbing instead
- Engine Scaling
  - Scales nearly linear, though not quite.
  - From 1 to 8 engines, it’s 6.8 to 7.8x WRT to IOPS (8KB I/O)
  - From 1 to 8 engines, it’s 4.2 to 7.1x WRT to bandwidth (64KB I/O)
  - Scaling from 1 to 8 shows worst numbers. 4 to 8 showed better numbers.
What’s the optimum size of a hyper or number per disk?
- General rule of thumb, fewer larger hypers will give better overall system performance.
  - There is a system overhead to manage a logical volume so it makes sense that more logical volumes could lead to more overhead.
- Frequently legacy hyper size is carried forward because of migration
- Virtual Provisioning will make the size of the hyper on the physical disk
  - You can create very large hypers for the TDATs and still present small LUNs to the host
- There can be a case of having too few hypers per drive
  - Because it could limit concurrency
  - Set a minimum of 4 to 8 hypers
  - Not an issue with large drives or protections other than R1
What is the optimum queue depth?
- Single threaded (or 1 I/O at a time), the I/O rate is simply the inverse of the service time.
  - For a 5.8ms service time your maximum IOPS is 172.
  - Same drive with 128 I/Os queued can get nearly 500 IOPS
- We need 1-4 I/Os queued to the disk to achieve the maximum throughput with reasonable latencies
  - Lower queue lengths if response time is CRITICAL
- Higher if total IOPS is more important than response time
- With VP, the LUN could be spread over 1000s of drives
  - Queue depth of 32 per VP LUN is probably a reasonable start
- As IOPS go up, response time will exponentially get worse
What is the optimum number of members in a meta volume?
- 255 maximum supported
- Reasonable sizes for meta member counts are something like 4, 8, 16, 32
- Even numbers are preferred
  - Powers of 2 fit nicely into back-end configurations
  - Powers of 2 not important for VP thin metas
- Getting enough I/O into a very large meta can be a problem
  - 32-way R5 7+1 meta volume would need at least 256 I/Os queued to have 1 I/O per physical disk
Should I use meta volumes or host-based striping? Or both?
- Avoid too many levels of striping (plaid)
- One large meta volume may outperform serveral smaller meta volumes that are grouped in a host stripe
- In many cases, host-based striping is preferred over meta volumes
  - One reason is because there will be more host-based queues for concurrency that the host can manage before even getting to the array.
- However, meta volumes can reduce complexity at the host level
- So it all depends
- 24-way meta versus 6 host x 4-way meta – average read response time was better with host-based stripe
Striped or Concatenated Metas?
- In most cases, striped meta volumes will give you better performance than concatenated
  - Because they reside on more spindles
  - Some exceptions exist where concatenated may be better
    - If you don’t have enough drives for all the meta members to be on separate drives (wrapping)
    - If you plan to re-stripe many meta volumes again at the host-level
    - If you are making a very large R5/R6 meta and your workload is largely sequential
  - Concatenated meta volumes can be placed on the same RAID group
  - Don’t place striped meta volumes on the same RAID group (wrapped)
- Virtual Provisioning
  - Back-end is already striped over the virtual provisioning pool so why re-stripe the thin volume (TDEV)
  - May be performance reasons to have a striped meta on VP
  - Device WP “disconnect” between front-end and backend
    - 5874 Q210SR, 5773 future SR fixes this
  - Number of random read requests we can send to a single device
    - Single device can have 8 outstanding reads per slice per device (TDEV on FA slice)
  - Number of outstanding SRDF/S writes per device
    - Single device can have 1 outstanding write per path per device
  - If it is important to be able to expand a meta, choose concatenated
What stripe and I/O size should I choose?
- For most host-based striping, 128KB or 256KB is good
- May want to consider a smaller stripe size for database logs, 64KB or smaller may be advised by a Symmetrix performance guru
- I/O sizes about 64KB or 128KB show little to no performance boost (flattens out). 256KB may actually decrease throughput. This is because everything is managed internally at 64KB chunks.
Segregation
- For the most optimal system performance, you should not segregate applications/BCVs/Clones onto separate physical disks/DAs or engines
- For the most predictable system performance, you should segregate
- Tiers should share DA resources so that one tier will not consume resources for another tier
What disk drive class should I choose?
- EFD provide the best response time and maximum IOPS of all drives
- 15k provide 30% faster performance than 10k (random read miss)
- 15k provide 56% faster than SATA, 10k provide 39% faster than SATA (random read miss)
- SATA still does well in sequential read (with single threaded and larger block sizes) (basically good in single stream, bad with multi-thread and therefore disk seeks)
What RAID protection should I choose?
- Performance of reads similar across all protection types (number of drives is what matters)
- Major difference with random write performance
  - Mirrored: 1 host write = 2 writes
  - R5: 1 host write = 2 reads + 2 writes
  - R6: 1 host write = 3 reads + 3 writes
- Cost is also a factor
  - R5/R6 are best at 12.5% and 25% protection overhead
  - R1 has 50% protection overhead
How much cache do I need?
- Easiest method is to utilize the Dynamic Cache Partition White If (DCPwi) tool
- Put like devices together in cache partitions
- Start analysis mode and collect DCP stats
How do I know when I’m getting close to limits?
- Watch for growth trends in your workload with SPA
- Look out for increasing response time (host-based tools like iostat, sar, RMF)
- Monitor utilization metrics in WLA/STP
- Better to be pro-active than waiting to hit th ewall
- Any utilizations well over 50% should be considered a possible source of future issues with growth

Performance as a Function of Utilization on CLARiiON

Heeki Park — Wed, 12 May 2010 23:00:08 +0000

Measurements

Utilization = 100% * busy time in period / (idle + busy) time in period
Throughput = total number of visitors in periods / period in length in seconds
Average Busy Queue Length = sum of queue upon arrive of visitor x / total number of visitors
Queue length = ABQL * utilization/100%
Response time = queue length / throughput (Little’s Law)

For low LUN throughput (<32 IOPS), response time might be inaccurate

Response time here is calculated, lazy writes will skew the LUN busy counter
RBA actually measures the response time

Dual SP ownership of a disk

Can also impact response time
Each SP only knows about its own ABQL, throughput and utilization for the disk
At poll time, they exchange views. The utilization is max(SPA,SPB)
ABQL is computed from the sum of the sum
And SP throughput is the sum of SPA and SPB throughput

Be wary of confusing SP response time in Analyzer with the average response time of all LUNs on that SP

Response time is calculated and based on utliization
A LUN is busy (not resting) as long as something is queued to it
An SP is busy (not resting) as long as it is not in the OS idle loop
While a disk is busy getting a LUN request, the LUN is still busy
While a disk is busy getting a LUN request, the SP might be idle
The SP response time is generally smaller than the average response time of all the LUNs on that SP
Host response time is approximated by LUN response time

Recall from last year:

Rules of Thumb
Multiplier (CPUM)
CX4-960 – 1.00
CX4-480 – 0.65
CX4-240 – 0.55
CX4-120 – 0.30
CX3-80 – 0.50
A – CPUM x 50k reads/s standard lun
B – CPUM x 16k write/s R5
C – CPUM x 20k writes/s R10
D – CPUM x 40k reads/s, Snaps, MV/s, clone source
E – CPUM x 7.5k writes/s MV/s
F – CPUM x 6k writes/s, clone-in-sync
G – CPUM x 2.5k writes/s, Snap COFW
H- CPUM x 6k writes/s, Snap non-COFW
Data logging % = Number of LUNs / Max LUNs * 10%
One SP’s utilization will be the sum of the proportional contributions of each I/O type
Use 4KB for IOPS and 512KB for Bandwidth
I = CPUM x 1500MB/s read
J = CPUM x 600MB/s write (cache on)
Note: ASAP rebuilt, background verify, mirror syncs count against this number
Example: CX4-960, RAID 5, 9000 IOPS, 2:1 R:W, 8KB –> 38% utilization
6000 read IOPs, 3000 write IOPs, 48MB/s read, 23MB/s write, RAID 5, CX4-960
6000/50000 + 3000/16000 + 48/1500 + 24/600 = 12% + 19% + 3.2% + 4.0% = 38.2% SP utilization

His formula is low

Configuration polling
- Pre-FLARE 26.31 configuration polling is another low priority internal function that affects utilization
- Go to http://ipaddress/setup
- Set Update Parameters in the Setup Menu and pick 300s. Update Interval to 300s.
- Performance Interval (for statistics logging) is ok at 60. This does nothing compared to configuration polling and data logging.
- Also include the -np (no poll) option whenever possible in CLI scripts
Data logging
- 7-10% differential comes from default data logging settings in older FLARE revisions with a lot of LUNs
- Throughput was still unaffected because Analyzer threads run at a lower priority than I/O threads
- Navisphere commands could be sluggish because they would be at the same level
- Fix it by changing from 60/60 or 60/120 to 300/300.
- Data logging poll rate is the lower of the two.
- This will signficantly reduce pre-FLARE 29 utilization
Navisphere operations, especially without -np (no pool)
Background verify, rebuild, LUN migration, zeroing operations
Snap, Clone, Mirror, SAN Copy overhead
Disk or bus bottlenecks
Heavy flushing

His formula is too high

Coalesced backend writes
Pre-fetch
Nature of the load

In FLARE 26.31, FLARE 28, FLARE 29, FLARE 30

Delta polling was introduced in FLARE 28 and back-revved to FLARE 26.31
Significantly reduces Navisphere overhead
FLARE 30, CLI commands without -np are given more processor time
FLARE 29, data logging utilization has been reduced 80%
FLARE 30 introduces fully provisioning virtual LUNs in pools of storage (thick LUNs)
H6099 document
NDU now uses % PrivilegedTime not % Processor Time as shown by Analyzer, 65% is safe (instead of 50%).

What will happen with SP utilization in the presence of EMC Flash Cache?

64KB is the base element for analysis for migration into Flash Cache
There is a considerable amount of promotions (HDD > EFD) that will cost SP utilization. After the bulk of those initial promotions occur, it will be about 8-10% increased SP utilization for Flash Cache after warmup.

VMotion over Distance with VPLEX

Heeki Park — Wed, 12 May 2010 19:57:31 +0000

Vmotion without VPLEX

Cannot directly perform Vmotion since storage is not shared
Must first perform storage Vmotion

Vmotion with VPLEX

Enables direct Vmotion between data centers
Storage Vmotion is no longer required
Replicate the data once then move the VMs at will

Use Cases

Data Center Load Balancing
- Optimize resources across several data centers
Disaster Avoidance and Data Center Maintenance
- Evacuate data center ahead of a probable disaster
- Move applicatoin to remote data center to perform maintenance on local data center
Zero-downtime Data Center moves
- Move VMs and data to new data center then decommission old data center

Three Basic Configurations

Common configurations
- Maximum supported distance, 100km (with 5ms latency)
- ESX hosts in both data centers have common IP subnets (stretched layer 2 network)
- ESX servers can participate in local HA and DRS-enabled clusters
VMFS volume built on a VPLEX distributed device
VMFS volume is then shared between ESX servers in two locations
Scenario 1 (distributed device)
- Best practice
- Continuous data protection and transparently protects against storage failures in either location
- Continuous IO on biased cluster after WAN link failure
- Continuous IO on biased cluster after non-biased site failure
- Suspend IO on non-biased cluster after biased site failure
Scenario 2 (built on remote device)
- Not highly available, only good for temporary use when VM must move immediately
Scenario 3 (temporary distributed device)
- Storage Vmotion to a distribute device while in transit to the remote site
- Then Storage Vmotion back to local storage in the remote site
- Do this to regain some array functionality that VPLEX might not have

Failure Cases

N+1 configuration handles director failures transparently
Any WAN or remote cluster failure while Vmotion is in progress simply results in Vmotion being aborted

Rule-set Best Practices

Manage your rule-sets very carefully
- Be aware of which cluster will win in the event of a failure
Place related VMs on the same data store so that they will move together
For any given data store, move all VMs at the same time
For the most critical applications, dedicate a data store to the VM