Al3xandr3

Data Posts

2017-03-12T00:00:00+00:00

July 25, 2013 Product and Application Design
April 10, 2013 Programming
December 5, 2012 Javascript Bookmark Scripts
April 2, 2012 Confluence automation
October 14, 2011 Table.query
May 24, 2011 Dashboarding
April 28, 2011 Who Chats the Most?
December 1, 2010 Homemade Auto-Updater
August 8, 2010 How to download videolectures.net videos with VLC
May 22, 2010 Clojure and Selenium part ii - cov3
April 10, 2010 jQuery Twitter 'mini' plugin
March 8, 2010 Automating todo tasks reports with org-mode
May 15, 2009 Clojure test-is results in twitter
April 24, 2009 Clojure and Selenium
November 25, 2008 Photo organizer
July 31, 2008 Stock Price Alert
June 27, 2008 Ebay Misspells Search

Data Posts

2017-03-12T00:00:00+00:00

June 3, 2014 Music Recording Cookbook
December 1, 2013 Music Production
May 1, 2013 Music Theory Chords

Data Posts

2017-03-12T00:00:00+00:00

February 17, 2014 Goals Implementation Plan
February 9, 2014 Game of Life
January 10, 2014 Understanding People
July 26, 2013 Favorite Talks
March 15, 2013 Work, Productivity and Tools
March 11, 2013 Health, Food & Exercise
March 7, 2013 Meta Learning
January 12, 2011 On Tools. Featuring guitar pedals, cattle growing and math
January 21, 2009 Why my keyboard has a QWERTY layout?
February 12, 2008 Funds 'R' US

Data Posts

2017-03-12T00:00:00+00:00

April 13, 2016 User Retention and the Aha! moment
March 7, 2016 Setting Goals, Planning and Executing
March 2, 2016 User Lifecycle Analytics Framework
March 1, 2016 Testing an Hypothesis
February 15, 2016 Impact Evaluation
November 15, 2015 Data Science in a Business
November 1, 2015 Data in a Business
October 25, 2015 Data Dashboard in Practice
October 14, 2014 Data Tracking and Collection
October 7, 2014 Data Analysis Skill Set
July 30, 2014 Data Analysis Mindset
June 26, 2014 Storytelling and Presenting
June 25, 2014 Arguments, tools and patterns of reasoning
June 24, 2014 Data Analysis Process
April 8, 2014 Data Analysis Techniques
March 2, 2014 Web Analytics Dictionary
June 20, 2013 Data Visualization and Presentation techniques
March 25, 2013 Data Analysis Introduction
November 9, 2012 Data Processing Techniques
January 11, 2012 How to get into the Semantic Web
December 18, 2011 Data, Data, Data!
September 30, 2011 Monitoring Productivity II - the Others
March 20, 2011 Machine Learning Ex5.2 - Regularized Logistic Regression
March 18, 2011 Machine Learning Ex5.1 - Regularized Linear Regression
March 16, 2011 Machine Learning Ex4 - Logistic Regression
March 8, 2011 Machine Learning Ex3 - Multivariate Linear Regression
February 24, 2011 Machine Learning Ex2 - Linear Regression
February 5, 2011 Weight Loss Predictor
October 20, 2010 Monitoring Productivity Experiment
March 16, 2010 AB testing tools in the Future
August 27, 2009 Probability simulation of basketball throws
February 2, 2009 Lem-E-Tweakit and Logic programming
January 18, 2008 Big Brother Google
January 14, 2008 Visualizing Data, with Processing and JRuby

User Retention and the Aha! moment

2016-04-13T00:00:00+00:00

What is the Aha! moment ?

Its the moment users decide they are continuing to use the product.

Aha moment is when users “get” the value preposition that the product offers to them.

It is not trivial to find the exact aha event of the users, but knowing it is key, and is tightly linked to the product core value proposition.

Examples: - Twitter: you see the feed of your first contacts and read something interesting - Uber: after a smooth experience ordering a ride, it arrives at your door - Instagram: get feedback positive for first shared pictures

Timing

When a user decides to take a product for a test drive, the race is on to get user to Aha moment.

The more time spent without Aha moment, the more likely the user is not coming back.

What is your product Aha! moment ?

Product/Market fit: Satisfy market demand for a solution to a specific problem

The Market: target audience
The Value: what is user getting out of it ? (value preposition to the user)
The Fit: How well it delivers the product value preposition to market.

Key: convey your value to your market in the simplest and clearest way possible. In a way that makes sense to users. Failing to convey your solution increases uncertainty ( a users that are not retained).

Finding it from data

Go look into the data and compare the users who survive (keep using the product) versus the ones who don’t…

What are the actions, the usage, they did of the product when they were new users?

Are there actions that the survivors did often and the non-survivors don’t? that’s the event that gets them to the aha moment.

And That is the event that you need to get all new users to do.

Metrics

Its about retention

They accept to take it for a test drive.
They keep using it long term (survival)

Getting to Aha!

Once you know what is your users aha moment, create a plan on how to on-board new users so that it leads them to their Aha! moment.

Try several approaches and optimize the experience using AB testing (always keeping a control group).

Don’t overly spam users, don’t keep telling them to do something or other all the time, there’s a good balance, there’s a right timing to put the right content in front of users. There is the risk that if you spam them too much to get to their aha moment, that you get them to their no-no moment.

Most important question to start with and to really understand: What is your product Aha Moment ?

References

Setting Goals, Planning and Executing

2016-03-07T00:00:00+00:00

Setting Goals

The Goals depend on 2 main factors:

The type of business, product, market. (example: e-commerce goals are different than SaaS goals)
The phase (or the goal) the company is in right now. Example: startups goals are different than a mature company goals, do we need more users now, or instead we have enought users, but we need revenue?

Metrics should cover the customer life-cycle stages, look at how a user interacts with the product/service, from finding the product all the way to buying something (depending on type of business) and create a funnel that looks at each of the stages, typically: acquisition > activation > retention (and churn) > revenue.

Setting Goals

Financially: They need to be able to support financially the endeavor
Find what are the typical values for the business with similar biz model. To know whether under or over-performing. For example, might turn out that the biz current the churn rate is perfectly normal and more importantly, we probably aren’t going to be able to move the needle even when investing heavily on it, better invest resources elsewhere.
Also depend on the current company phase, where goals in a startup company are different from mature company.

Reference: Lean Analytics

Planning

Given the goals it is then needed to make a plan on what are the individuals steps needed, how others contribute, etc…

Executing

Scrum

OKRs

Objectives and Key Results

Why ?

Focus: What do we do and what do we not do as a company?
Alignment: How do we make sure the entire company focuses on what matters most?
Acceleration: Is your team really reaching its potential?

References:

http://eleganthack.com/the-art-of-the-okr/
http://eleganthack.com/why-use-okrs/

User Lifecycle Analytics Framework

2016-03-02T00:00:00+00:00

The “User Lifecycle Analytics Framework” is a structured way to look at a metrics of a business to help understand its current status, measure evolution, find opportunities, etc…

Different business might need somewhat different frameworks, but an important realization is that having a framework is useful, gives structure, reveals measures to look at and insights that ultimately help the business.

A business will naturally go through different stages and will have different needs depending on the stage that is in. For example, a business might not have enough users, and thus needs to invest on acquisition, or instead it has enough users but the revenue is not realized yet.

So in general the Business might need over time to shift focus and zoom in on a specific topic at a time, while never loosing attention on the business as a whole.

This framework is about the user lifecycle, it considers that within a business the user has a specific lifetime and different phases of maturity, from first discovering the product, to becoming an active user, to using it consistently, to potentially influencing others to join in, etc… all the way to stop using it.

Each business has natural specific user lifecycle strengths and might even want to target on specific strengths.

Acquisition

Activation

Retention and Stickiness

Growth

Virality

Net Promoter Score - tool that can be used to gauge the loyalty of a firm’s customer relationships. It serves as an alternative to traditional customer satisfaction research

Churn

Churn is a measure of when customers leave, of when users stop using the product or service. A business wants to keep its customers, so it wants to minimize churn. Churns is all about “customer retention”.

A look into a business churn

Churn defined: it might vary with business
What is the current churn ? -> funnel of churn at different points in user life cycle.
Is current churn acceptable or really worrying ?
- Churn over time
- Competitors churn
- Breakdown by country, platforms, etc…
How and Why users Churn? -> Deep dive Compare non-churning users to churning users
- Number of platforms used
- Do 1st day events have impact in churn (short/medium/long term)?
- Does regular product feature usage (or lack of) have impact in churn?
- Just before churning (and comparing to a previous time), is there a:
  - Change in usage frequency? - Last features used before churning?
- What time of year are top churners from and hypothesis and why ?
- What period in the business life were the top churners from and why ?
- Can we predict a user is churning soon, given his recent usage?
- etc… all kind of breakdowns we could think of…
(From the How and Why users Churn? insights) Suggestion on the opportunities for the business to invest on.

Similar structure can be used for other topics, like acquisition, retention, etc…

Avoiding Churn

Understanding why users churn can be useful, it can open opportunities for saving users from churning. A Churn prediction model would tell us userA has 95% probability of churning next week. We could then reach out to the user with an incentive (email, an offer, a survey) to try avoid the churn.

Acceptable Churn

We often want to have the churn as small as possible, but do take into account of what is the common churn for this type of application, from competitors for example, because it might be the effort of reducing churn from a certain point onwards requires an effort that could bring better returns if applied else where.

In an extreme example: the money invested in reducing churn from 3% to 1%, if applied in a new feature of the product, might return more users than the gain of extra users saved from churn.

Revenue

Quality & Performance

Business Scorecard

Once the frameworks and its definitions and metrics are agreed upon, we can build a scorecard that includes all of the framework’s metrics so we can have a view of the business end-to-end.

A funnel or an approximation of it is commonly used approach.

This can be generated as a recurring report for: Week on Week, Month on Month, Year on year, etc…

Besides the framework metrics, we need include also the defined top level business goals metrics, they might have been defined outside of the user life-cycle model context.

Measuring activity impact

This scorecard can also be used for another application, which is measuring impact of a product change:

Run an AB test
Apply scorecard to users affected by the new change and compare to the scorecard of the users not affected by change.
See whether the product change has had a successful impact in overall business.

The scorecard applied in this way, should be answering 2 questions:

Does the change impacts to business goals? (the top level business goals)
Why? What user behaviors are we changing that contributes to that? -> we can see by looking at what part of the scorecard (user life-cycle) changed the most.

Metrics: Active Users

A fairly standard metric (industry standard) is the user activity in the application (in-app activity), for example:

DAU, daily active* user
WAU, weekly active user
MAU, monthly active user

*Activity is defined by user explicit interactions with the application. For example, open the application, browsing application content, using application features, etc… similar definition to a web user session. Note that notifications that reach the user but are not interacted with, should not trigger an AU signal.

The AU metric in the User life cycle:

Acquisition (installs/DAU, registrations/DAU, trials/DAU): is a measure of new users, how much of the DAU is from registration, or from installs, or from trials…
Retention (or Survival), if after installation user still active in application, for Day1, Day2, Day3, etc… Week1, W2, W3, etc… Month1, Month2, etc…
Engagement (DAU/MAU), DAU divided by MAU, ex: 50% means that for 15 days in a month (30 days) the user was active.
Growth (or Virality): K-Factor, = Viral_Installs / Total_Installs, or = Viral_AU / Total_AU
Revenue: Average Revenue per DAU

Other metrics:

There will be other application specific metrics to use. These are dependent on application type.
Meaning of different metric trends: ex: MAU is growing but DAU is flat, what it means? that there’s loads of churn, many users using the application, but they don’t stay, so they don’t accrue to DAU.

Reference: http://www.slideshare.net/agarimella/social-gaming-metrics

Testing an Hypothesis

2016-03-01T00:00:00+00:00

A collection of notes on running AB tests in practice.

A/B testing and data quality mechanisms

A big part of that is a test of statistical significance. But there are a few other useful mechanisms that also can be used. For example the A/A tests.

When should an A/A test be used?

Do an A/A test first in order to test your split testing framework. If the difference between the two is statistically significant at the decided level, then your framework is broken.
Do an A/A/B or A/A/B/B test and throw out the result if the two A treatments or the two B treatments show different results that are statistically significant at the decided level.
Do many A/A tests (say 100) and if more tests than expected show statistically significant differences, then your framework is broken/statistical significance is useless.

Reference: http://blog.analytics-toolkit.com/2014/aa-aab-aabb-tests-cro

Example: A/A/B Traffic split

Imagine you are planing a test where there is a new version to try out, by far the most common test, and you decide that is a safe bet to use 1% to the new experience and later on increase its volume as we are more confident.

When we intent to increase the 1% to 5%, and we have started with a setup at 1% vs 99%, increasing (1% to 5%) will force change the overall mix to 5% and 95%. This is changing volume of 2 groups at the same time… Ideally we want to change less variables as possible in the testing to be sure results are influenced only by the variable we in fact intended to test. Thus a better strategy is to start with 1% / 10% / 89%, because then we have room to grow from the 1% to 5% without eating away from control group (the 10%), we can use volume from 89% that is not being tested. Going to something like 5/10/85, see the control group (10) stays unchanged guaranteeing volume hasn’t changed.

Evaluate if the sample is a good representation of the population

Make sure the sampled cohort is a good representation of the whole population and not a biased one.

This is sometimes a subtle fault in tests, (often) hard to tell because there is no comparison to validate it against. If possible a way to validate the sampling mechanism is to run it first against a known population, check if sample mechanism is indeed working as expected and then run it against an unknown population. (a simulation of sorts, but sometimes not possible).

For example: Lets say we can only run the test on a particular country, so first evaluate whether, that country is a fair representative of the rest of the world:

Look up the numbers for all the world.
Run an experiment, where you validate whether the country you are targeting has the same results as all the world.

If we get same results we know we can run a fair experiment on that country, that we can then use to extrapolate results for all the world.

How long to run a test for ?

Power is a measure of how well you can distinguish the difference you are detecting from no difference at all. So running an underpowered test is the equivalent of not being able to make strong declarations of whether your variations are actually winning or losing.

When your test has a low conversion rate for a given sample size, it means that there is not yet enough evidence to conclude that the effect you’re seeing is due to a real difference between the baseline and variation instead of due to chance: in statistical terms, your test is underpowered.

References:

How to check results are trustworthy ?

Statistical significance answers the question, “How likely is it that my test results will say I have a winner when I actually don’t?” Generally we talk about this as 95% statistical significance. A different way to say the same thing is that we will accept a 5% false positive rate, where the result is not real (100% - 5% = 95%).

Reference: https://help.optimizely.com/hc/en-us/articles/200133789

On Experiment Data Granularity

On an ab test, essential required data points are:

startA
successA
startB
successB
Granularity

But then we have the option to look at data by different granularity:

Event level
Session (or visit) level
Users (or visitor) level

User can go to start many times (like in web pages), and every time an event is recorded. From those we will calculate test success like: successA / startA

Which Granularity is best ?

Events: The sum of events alone will be biased because user can go to start many times, and every time we count successA + 1, thus lowers the attempt success rate, but in reality is the same attempt still. So, no good…

But between session vs user level ?

Depends on the test: User: when we are ok to allow user longer period to complete the flow ? like a week of being exposed to it that eventually succeeds.

ex: marketing promotion that reaches user from different channels, continuously for a week, maybe all contribute a little in nudging the user towards a sale.

Session: When we want measure a success or fail for only an attempt, then best use session. Like each time you try we measure succeed or fail.

ex: Attempting to do a login

In Practice: I’ve been finding that most often session level is needed. When only user available for an experiment where session is desirable, then group data by a small time granularity, saying for example users had 1 day to attempt it.

Experimentation Caveats

Experimentation wont do magic, it helps improving in the direction you set it to. But it won’t reveal itself new directions. Plan up front what will be the learning if the test is successful.

Not all experiments will result in a success. And it can happen that the business KPIs improvements are not very big. But not all is lost:

On the unsuccessful experiments try understand why, and keep it as a learning for the future tests.
For example, a 3% increase in a business KPI ends up translating into several Million users, so is not necessarily small win.
The bigger the business, maybe the less the opportunity for improvement, and thus the smaller the potential returns of the experiments.
Bigger returns likely only to happen in more radical tests. Ex: changing color of button has less potential return, than changing the whole user flow.

Some experiments might have as hypothesis that impact is only visible in long term. So run the experiment, turn it off, look at the affected cohort users after 3/6 months.

References:

Impact Evaluation

2016-02-15T00:00:00+00:00

Impact Evaluation

Impact evaluation is structured to answer the question: “how would outcomes such as the participants well-being have changed if the intervention had not been undertaken?” This involves counter-factual analysis, that is, “a comparison between what actually happened and what would have happened in the absence of the intervention.”

A secondary question that normally comes imediatly after is “why?”. What behaviours did we change that explain the observed impact of the intervention?

The key challenge in impact evaluation is that the counter factual cannot be directly observed and must be approximated with reference to a comparison group. There are a range of accepted approaches to determining an appropriate comparison group for counter-factual analysis, using either:

Prospective (ex ante), selection of treatment vs control
Retrospective (ex post) evaluation design.

5 things that can contaminate measure impact: Confounding, Selection bias, Spillover, Contamination, Impact heterogeneity.

(ante) Randomized field experiments are the strongest research designs for assessing program impact. This particular research design is said to generally be the design of choice when it is feasible as it allows for a fair and accurate estimate of the program’s actual effects

Non-experimental design

Non-experimental designs are the weakest evaluation design, because to show a causal relationship between intervention and outcomes convincingly, the evaluation must demonstrate that any likely alternate explanations for the outcomes are irrelevant.

However, there remain applications to which this design is relevant, for example;

Experimental evaluation takes time.
Experimentation might not be possible.

On product launch

How we then quantify the final real impact ?

I general we can look at the top level KPIs and see if trends change to have an idea if it really changed things or not.

But in order to quantify it somewhat, an approach is to compare the time period before and after deployment and see what is the difference.

The caveat with this approach is that there can be other external confounding factors influencing it, like seasonality holidays, weekend vs week days, other deployments happening at the same time, etc… So this approach has to be carefully planned for and in general: try find ways to exclude out as many potentiality influencing external variables as possible.

When results show a small difference we are less confident about the results, because the external influences could be biasing somewhat it to either direction.

Methods

Quasi-experimental (non-random) methods can be used to construct controls when it is not possible to obtain treatment and comparison groups through experimental design. With constructed controls, individuals to whom the intervention is applied (the treatment group) are matched with an “equivalent” group from which the intervention is withheld and the average value of the outcome indicator for the target population is compared with the average of that for the constructed control.
Another nonrandom method of obtaining control involves reflexive comparisons when participants who receive the intervention are compared to themselves before and after receiving the intervention. -> Beware of selection bias.

Estimation methods

Comparison of means, An estimation method to be used with experimental design.
Multi-variate regression analysis, An estimation method to be used with non-experimental design
Instrumental variable method, An estimation method to be used with non-experimental design, Instrumental variable method is used in statistical analysis to control for selection bias due to unobservables. The “instrumental variables” are first used to predict program participation; then one sees how the outcome indicator varies with the predicted values. Often, one can use geographic variation in program availability and program characteristics as instruments especially when endogenous program placement seems to be a source of bias. The researcher may attempt to estimate the causal effect of smoking on health from observational data by using the tax rate for tobacco products as an instrument for smoking. The tax rate for tobacco products is a reasonable choice for an instrument because the researcher assumes that it can only be correlated with health through its effect on smoking: https://en.wikipedia.org/wiki/Instrumental_variable https://www.quora.com/What-are-some-examples-of-really-clever-instrumental-variables-approaches-in-econometrics
Double difference or difference-in-differences methods, an estimation method to be used with both experimental and non-experimental design

References

https://en.m.wikipedia.org/wiki/Impact_evaluation
http://web.worldbank.org/WBSITE/EXTERNAL/TOPICS/EXTPOVERTY/EXTISPMA/0,,contentMDK:20188244~menuPK:384339~pagePK:148956~piPK:216618~theSitePK:384329,00.html
https://en.m.wikipedia.org/wiki/Economic_impact_analysis
https://en.wikipedia.org/wiki/Instrumental_variable

Data Science in a Business

2015-11-15T00:00:00+00:00

Data Science Recipe

A quote from http://blog.kik.com/insights-from-kik-team-bot-at-app-annies-decode-2015

What goes on behind the scenes? We use (1) a variety of tools to normalize data and then (2) cluster results depending on whether people are power or passive users. We then (3) build multiple statistical models concentrated on the area's most likely to drive growth or retention, which generally uncover our best user's patterns and behaviors. When we (4) find attributes that pop (and sometimes its hard to say if certain sets of behaviors are causal or just correlated), we (5) run multiple tests with four or five permutations to determine the most effective program. By stacking a bunch of these programs on top of each other, we can propel growth for Kik.

I like this quote, because this in essence a recipe on how to apply data science methods to a business in a very practical way.

1. Normalizing

(Statistical) Normalizing is about adjusting different measures, or even different data sources into a common alignment that is then fair to compare and join together.

For example: Normalization of ratings means adjusting values measured on different scales to a notionally common scale, often prior to averaging.

2. Create classifications

Given the business goals, identify the users that are already ideal cases versus the ones who are not, for example if the company goal is to have highly engaged active users, identify the power users versus the rest.

Augment the existing user data with that “classification” information, this will create the goals to create statistical models on.

As time goes by and as the first classifications and learnings are collected, that then feeds back into the next set of classifications to create. So over time, the classifications will tend to be more sophisticated and closer to real user behaviors.

3. Model

Modeling is about finding a formula that given a set of input variables it will then output a classification.

An example, power users (classification) are the users who clicks a particular button > 25 times (input variable) on workdays (input variable).

So modeling essentially scans all data* and finds that ideal users are the ones who often do XYZ (indicating a strong correlation). This modeling process is applied to the classifications created before.

*Sometimes the data that best correlates to the classifications might not be available, this requires either new telemetry, or joining in new data sets.

The purpose of creating a model is to reveal what behaviors are associated to the ideal users and that essentially tackles the question: “What is driving the business goals right now?”

4. Hypothesis

Once we know what is driving the Biz goals, we then tackle “Where are the opportunities for growth?”:

What can we change/introduce in product that will drive that further?
How can we get the non-power users, to became power users?
or Turns out the users are actually using us in a different way we were imagining. Maybe we need change product strategy (pivot the biz).

This step looks into the learnings and outputs a collection of “what if we do this next?”

5. Experiment

Because correlation does not (always) imply causation we then need to run Experiments (A/B tests) that measure the impact of the hypothesis created before. Accelerate the testing by using multivariate A/B tests, where in a single experiment multiple features get tested at the same time.

Data in a Business

2015-11-01T00:00:00+00:00

The Essential Questions

What is the business about?

Understand What is its core mission, what value it offers to users.

Also, who are the competition, how are they doing, and how is this business better than the competition ?

What data is required?

Define the necessary infrastructure to allow measure the business:

Choosing and defining the metrics that allow to measure the business (the KPIs).
Create / Define the frameworks and standards, for:
- Measuring (telemetry guidelines): What are the events we need capture? and how? Document it.
- Collecting and processing data: What aggregations need to be done? Document it.
- Testing hypotheses of the impact of new activities (AB testing, pre/post launch analysis)
- Standard data definitions and methods (Churn, Survival, ARPU, Stickiness)
Reporting Infrastructure.

How is the business doing?

Answer this, By understanding the top-level reality of the business. This is very tied to the previous step, but also involves:

Know what are the business goals. Be part of the discussion on defining the business goals if needed. These are key, this is how we evaluate the business health, each business activity and even each individual contribution.
Setting up, analyzing, digesting and sharing frequently the top metrics.

The Essential Questions, part 2: day-to-day activities

Once is clear what is the business about, how is it doing and where it should be going there’s the need to accompany the business on its daily activities.

What should the business be doing?

And equally important, what it should not be doing. Includes:

Understand what drives the current KPIs volumes? How can we get more of that ? What opportunities are we missing? What kind of activities do not drive KPIs further?
Advocate that we should be working on the right things, aka the activities that potentially contribute the most to the set goals. And avoid wasting resources on the activities that don’t contribute as much.
- Guide teams on creating the hypothesis of the ROI for each planned activity (guessing the ROI of an hypothesis is a hard thing, but it should leverage the learnings from the previous activities).
To be able to fairly and scientifically evaluate the activities, there’s the need to be a constant data advocate and educate business on measuring and being data driven. For example argue that every activity should include an AB test.

Did the activity help?

On each activity answer: did it (the activity) help achieve the business goals? And by how much?

Is a quantification exercise.

Here often the activity owners ask from the analytics team data the results:
- The resulting analysis should be tackled in a structured and consistent way, across every activity (as defined in the “What data is required?”), this will allow to compare activities directly.
- Activity owners might not be asking for the right metrics that best measure the business impact. So advocate instead for the right metrics to use.
- Naturally, activity owners are highly biased about their activity success and might not acknowledge bad results. But data analysis ethics require the truth.
“And by how much?”, is essentially the contribution of the individual activity towards the end goal, often hard to measure, but key even if just an estimation.
When activity is not helping (towards the goal) then data analysts are responsible to raise it with adequate communication, to allow for timely course correction.

What did we learn?

Once we have the numbers on by how much did the activity help, create an opinion of what worked what didn’t and why? and create hypothesis for the next activities.

When activity is unsuccessful ask why, was the problem in the hypothesis or in the execution?
Build and curate the history of what worked and what didn’t as it helps crystallize the mental model for ROI estimations of new activities in future.
Communicate the learning to rest of business organization, so that all organization becomes smarter.
- Communicate using best practices: tell stories instead of dumping numbers, be straight to the point (see minto pyramid).

Data Dashboard in Practice

2015-10-25T00:00:00+00:00

Motivation

Approach dashboard building in a structured way, so that building new dashboards is just following a recipe and it can thus scale up well.

Unpivot Structure

At its base its the idea of unpivoting a several columns table into a 3 fixed columns table for any metric.

The Idea

Transform:

date	users	clicks	money	ranking
1	50	100	12EUR	1

into:

date	metric	metric_value
1	users	50
1	clicks	100
1	money	12EUR
1	ranking	1

This approach forces a fixed column structure and thus it becomes consistent to manipulate and easy to add new metrics.

To add interaction use for example Excel slicer on the metric, and plot the metric_value by date.

We end up with a report interface that shows many views in a compact way. For example 1 chart and 1 table + slicers to switch between dimensions, can show a lot of data in a small space.

Also this approach makes it easier to add dimensions into a report without having to change the report format considerably.

In reality we might want to add more columns as extra useful dimensions, like country, type, etc…

But once you keep using this structure as a base, there’s whole lot of methods developed for one report that you can then reuse for others.

De-duplication Problem

When including several breakdowns like country, be aware that some metrics can’t be summed up to get total, for example users in country. Imagine that same user has been in 2 countries in the same month, that in total sum would show up 2 times, inflating the number artificially.

To avoid the de-duplication problem, calculate the de-duped number upfront (in SQL) and include it in the data extract before going into dashboards.

Example:

Imagine in same month I’ve traveled between Estonia and Portugal, then data extract should look like:

date	metric	metric_value	country
month1	users	1	Portugal
month1	users	1	Estonia
month1	users	1	All

Data Querying (SQL)

Use WITH statements on the sql query, to create more reusable and cleaner queries. Makes it easier to see (and understand) the sections to replace to get a new data point. Query can be structured like: “With filter1 as (…) and filter2 as (…) select …”
Is better to run a separate sql for each dimension and join it afterwards. Instead of trying to do 1 massive sql query that collects everything but is hard to add/remove things. ( often is not possible to do that massive query on 1 go)
Do an union of each query used, to join all the data at the end, each query should output the same columns.
Crystallize queries into stored procedures of database UDF when the report is in final version, otherwise adds a big overhead into the development process.

The Date Dimension

Pull in an external table with all the date calculation needed and join on the day for example (or some attribute that makes sense). No need to keep rebuilding data functions for each data set, just need to make sure can join the data on some attribute.

For example a calculated date dimension: http://www.ipcdesigns.com/dim_date/

Doing the Unpivoting

In Excel: PowerQuery -> Unpivot

In Python: pandas.melt

MoM calculations

“Month on Month Growth” in power pivot DAX language:

Last Month:=CALCULATE(SUM(‘MYDATA’[metric_value]))

Prev. Month:=CALCULATE(SUM('‘MYDATA’[metric_value]),DATEADD('‘MYDATA’[Month],-1,MONTH))

MoM Change:=IF([Prev. Month],(CALCULATE(SUM('‘MYDATA’[metric_value]))-[Prev. Month])/[Prev. Month],BLANK())

Assorted Tips

ETL often getting in the way: If ETL is slow and often fails, just run it with a day delay, that is depend on yesterday’s ETL, not today’s one. Almost always that will be done.
Visualizations, Charts Axis Order: if for some reason the tool doing the visualizations does not support order, then add numeration into the metrics names. That will force it the right order.

Data Tracking and Collection

2014-10-14T00:00:00+00:00

Motivation

To understand what is going on with a product/organization/strategy we need to place probes in specific points of its structure to be able to measure them in detail. Often there are several data points generating data, that need to be reconciled, cleaned up, and aggregated into a consumable format.

Therefore, each organization will have the need to:

Identify its critical points to measure. (KPIs)
Put in place the telemetry that can measure and capture data.
Reconcile and aggregate all the data points into a consumable format.

Once the data from several sources is in a consumable format, analysts can then start querying the data and revealing insights of the reality that better inform the organization future strategies/actions.

Data repositories

What is common in all organizations by default

A production database that contains data about the users and their transactions with the product - Optimized for high availability and speed of end user facing systems, and thus not optimized for analysis work, even risky to be open for analysis, as one mistake could end up impacting users.
Most applications (eg. web servers) generate a record logs when they are in use. - these were made for keeping a record of the application execution, they exist in production environments and are not easily / directly consumable by analysts.

Data Analysis repositories

Then for the purposes of doing analysis this production systems data is then collected, cleaned and aggregated into a separate data repository optimized for analysis work, that can generate reports automatically, navigate data more easily, i.e simplified and aggregated data that is more easily consumable.

Dedicated Analytics Telemetry & Collection

Increasingly more common is to have also a dedicated telemetry and collection system that are placed explicitly in the “critical” points we need and that captures the exact data we define. This allows for more specific data measuring, and makes available data that does not exist from default systems (or is too cumbersome to use).

It requires somewhat more effort from the organization: add the telemetry to product (into the critical points), have extra mechanisms and support to handle telemetry and data collection systems.

Dedicated vs default, what to choose ?

What is required depends on the data needs. Default sources, are typically always there, and make sense when are in easy reach and consumable format, but when data is needed that does not exist in default systems, then the dedicated way is required.

UI Click-stream data is “the” typical dataset that requires dedicated telemetry system.

Data structures

The industry standard for data representation is a table, in .csv or very often a relational database table. Recent formats also include key->value paradigm (JSON, Hadoop). For consumable data, the relational table format seems to still persist as the most successful one (hint: Hive on top of Hadoop).

Realtime vs non-Realtime

Looking into collected data often branches into 2 different needs:

Realtime: used for monitoring systems, used to identify as early as possible if a server or a service is having problems that could be impacting end users. - mostly fairly raw data “health” signals.
non-Realtime: Where analysis go look for insights to input into product strategic decisions. - Often using more complex data aggregations.

Telemetry principles

Some care should go into adding Telemetry.

In general an ideal approach to telemetry is to instrument the APIs that get re-used widely across the organization, instead of each of the end products individually. - Realistically, might often be needed very product specific details, that require “manual” telemetry.
Telemetry APIs should be simple and obvious to use: Telemetry is a secondary priority compared to the main product development, often done in a hurry, more complexity means higher probabilities of things going wrong.
The data captured in telemetry needs to be properly planned and validated given the required KPIs - you don’t want to find yourself realizing after 2 months of live data that in fact we don’t capture all data we need to meet the requirements.
- Telemetry should be defined as very simple, design for simplicity, capture the obvious things in a simple way - for any requirement/ KPI should only need 1 filter to surface later. Complex, big, tricky filters, will more often introduce data unreliabilities.
- Write down a telemetry specification requirements. Map the telemetry to the reality very clearly, i.e. include screenshots of the tracking placements, include diagram of sequence of actions/user-flow and the exact tracking values placed on them. This will help later on with analysis work, and at same time make it very clear for developers.
- Setup a report in QA phase to make sure the telemetry will capture requirements.
- The Telemetry Spec should be handed over to developer QA team for validations also.

Data collection types and caveats

Logs

Caveats

Bots will influence traffic captured in logs. Bots typically don’t run javascript so javascript telemetry does not suffer from this problem
Caching will also influence logs. Cached content is served back to user without hitting the web server, thus never reaching logs. Javascript telemetry helps is always executed even with cached content.

Tools

https://www.youtube.com/watch?v=Kqs7UcCJquM - elastic search, Kabana, logstach
http://docs.fluentd.org/articles/free-alternative-to-splunk-by-fluentd
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Logging_Solutions_Recommendation
http://codeascraft.com/2011/02/15/measure-anything-measure-everything/
http://www.fluentd.org/blog/unified-logging-layer

URL get/post (includes Javascript telemetry)

Here i include Javascript and other telemetries mechanisms that work by doing a post/get request to a url with data as parameters. A very common one is javascript in web analytics.

Caveats

Can be blocked by adblockers, firewalls, 1st vs 3rd (up to 20% loss?) party cookie browser policy, http requests are not guaranteed delivery: all these will make it that some requests are lost with javascript tracking.

Tools

Google Analytics

Big Data

Google, for their search engine, as a need to be able to handle massive data, created a tool, based on distributed file system that run parallel MapReduce jobs. Hadoop is the outside-of-google same approach of that technology.

More recently Google created a 2nd tool called dremmel, that is faster and better for data analysis EDA. Supports SQL natively. It also exists as an online service from Google, called the BigQuery.

Reference

Data Tools Ecosystem: http://insightdataengineering.com/blog/new-ecosystem/

Data Analysis Skill Set

2014-10-07T00:00:00+00:00

Values

These might vary with company culture, but are fairly common:

Company Culture

Customer Focus - In a product that provides a service, its all about the end customer, you need be helping the end customer in a useful way, don’t hurt the view that the customer has on you.
Take Ownership - Its your baby, everything about it, is on you, make sure the project goes all the way through to completion.
Sense of Urgency - Move in an agile way, plan, monitor, re-think, but move onwards at a good pace.
Highest Standards - Produce work you are proud of.
Humility - Everybody makes mistakes, everybody has different views and these augment an idea into something bigger, listen to others.
Exposure points - think what part of your work gets exposed to others, that helps prioritize what is important and what really makes a contribution.

Data Analyst

What is the truth? - The job is to reveal the truth, its OK to have an opinion but do be on the lookout of biases influencing you(from you or others).
Macro picture thinking - analyst needs to constantly asking “so what?”, what does this really means to the bottom line

Data Analyst ‘Hard’ Skills

Data Collection and Manipulation

Basic ETL developer: data collecting, scraping from online sources, augment collected data with useful dimensions (often the time dimension).
Telemetry: define adequate telemetry to tackle a problem, avoid over-complicating, but guarantee we will get all data needed for project.
Data manipulation: table data (pivot, unpivot), csv, Json, XML.
Basic understanding of dimensions and facts in a data warehouse and data-stores in general.
Relational databases and SQL querying.
Big data tools: hadoop, hive.
Python

Data Analysis Methods

Data QA: validating data to make sure is correct, try to understand data (http://bit.ly/1CWmrNc).
Back of the envelope calculations.
Define KPIs: understanding what is important, the organization context and domain, understanding the data being captured (how is captured, what limitations it has), surfacing what metrics more closely relate to the data analysis challenge at hand.
Understanding A/B tests and experiments: understands the process on how to test: sample size calculation, calculate confidence in results.
Understand Clickstream data (and web analytics data as a sub type of Clickstream): users, sessions and hits and their scope. Understanding why Month unique users is not same as the sum of 4 weeks of unique users.

Visualization: understanding what charts to use, both when analyzing and presenting/sharing data.
Regular Reports: what to include, what not to include, how to best display findings in a fair summarized format, how to automate regular reports ( or minimize update effort ).
Presenting / Sharing Data: putting a slide deck together that tells the story of the findings, adequate summaries and adequate charts.
Infographics
Excel
Understanding of chart types: Time series, funnels, Maps, heatmap,
Interactive, creative charts: D3.js, Tableau, MS Power BI stack

Modeling / Statistics

Statistical Methods: median, correlation, what methods to use in which situations.
Prediction methods: regression, moving days average.
Data modeling (and Machine learning): when to use, what they are good for, how to use.
Hypothesis testing: p-value, Chi-Squared Test, T-Test
R

Data Analyst ‘Soft’ Skills

Analytics Skills

Understand the impact that the analysis results will have.
- Understand well the problem statement.
- Understand the context / domain / business.
Tackle an analysis problem in a structured way, understand dependencies and limitations upfront.
Includes reliable QA methods in the analysis process.
Seek for unbiased truth. Reveal the reality, shielded from personal or stakeholders biases.
Understand and be skilled with the analysis toolkit.
- Query and manipulate data.
- Know how and when to apply adequate statistical methods.
- Know what is best chart to use for each type of data.
- Good understanding of tooling available and how to use it.
Clear communications: make sure the results are well communicated:
- In document form: clearly written, easy to understand for the intended audience, adequate content, adequate charts, not sloppy, etc…
- In presentation form: compelling presentations for any level audience.
Takes ownership of project, proactive attitude: drive for resolution, impact analysis, pro-actively alerts who needs to be aware of impacts, make sure results are consumable by who needs them (and on time).
Understand, follow and advocate the data privacy policies.
Know how to prepare adequate telemetry, that collect the results that answer the projects needs.

Effective Work

Can understand the problem and priorities, and self-sufficient to make a plan, and set priorities that are optimal to drive the project, understand which projects/tasks are more important and why. Example: how big impact can create, is it running now in production and could be impacting users (urgent), etc…
Identifies and highlights risks early in project, and plans for them.
Understanding of how current work impacts the overall org./biz
Can influence/drive progress with articulating the value of the project. Project advocate.
Can give time estimates and can articulate areas of risk, dependency and confidence in relation to ETAs.
Tracks and documents project progress on a regular basis.

Communication

Understand the request clearly, actively investigate numbers and probe further all parties involved to make sure the request, task and assumptions are clear.
- Summarize and translate a problem statement into a few bullet points and confirm with stakeholders that it is correct - this helps confirming understanding, and agree on a clear path to tackle them.
- Remember: assumption is the mother of all mess up’s. Ask, re-ask, to confirm assumptions.
Agree with stakeholders the task requirements early in project.
- Set expectations
Keep communicating regular as the project develops.
- Keep track and agreement w/ stakeholders, and make it transparent of goals, objectives, next steps, milestones, etc…
Create good written summaries of the results, including explaining well complex data and choosing adequate charts and data explaining methods for it.
Masterful storytelling: clear, contextualized and engaging outcome-oriented storytelling(written and presentations).
- Can influence and give convincing verbal communications.
- Can articulate well complex technical details.

Stakeholder Partnership

Well defined responsibilities, accountabilities. What is expected.
- Share with stakeholders the analytics plan and get their approval.
- Deliver on plan.
Manage expectations by communication well and checking of understanding.
Flag up dependencies, risks and offer solutions (next steps) to problems.
Look at the stakeholder problem/project from ground up (extensively) and suggest KPIs, reports and frequency that best tackle the problem.
- Ideally needs to be included in the planning discussions (and in defining analysis solutions) in the stakeholder project sessions. (but depends whether stakeholder is by default the data driven or not).
Follow and respect the agreed processes when requesting time and work from other teams:
- Create a task in a ticket system, requesting the work.
- Book a meeting when hard to reach someone by mail or IM.
- Etc…

Development

Look to increase visibility of the work done, share results widely, monthly/weekly summaries.
Have a personal development plan: be aware of personal development needs, and create a plan to tackle them: new skills to develop, new tools to increase productivity, improving and mastering current processes. - book time every few months to evaluate what should be develop next, what is not being done well, what could help more, etc…

References:

These are learning from mostly on the job sources and my own learning, not all my original ideas.

Data Analysis Mindset

2014-07-30T00:00:00+00:00

Analytics MindSet

Be constantly curious

Ask questions, Don’t be afraid to ask strange questions (if they offend, keep them to yourself but question anyway), in data analysis is good to ask, the apparetly obvious things, to question the established way and the assumptions. Assumptions/limitations/status change over time, questions if existing assumptions still stand. Think child like: why why why.

Frequently ask: I wonder what…, I wonder how… Have a real wish to understand things.

Think like a child

Think like a child, ask a lot, question obvious things, adults overcomplicate matters, a child looks for simple explanations.

Think small, big problem space has too much complexity, smaller is easier to tackle and quicker to get to results.

Go for virgin territory, in already tried out fields is less likely to be able to find new things, unless there are new ways to see it.

Remove emotions

Try to understand by removing emotion biases. Seeing it detached, from the outside is also required to be impartial.

Crime, deception, ugly and wrong things in general, bring loads of emotion, seek to understand why it happens without emotion. There is a reason. I would argue that in general any person put in some situation (current incentive) with same upbringing (past history), with the same physiology (a mental illness, for example) would do exactly the same.

Saying “i don’t know” is a good thing

Is important to say i don’t know (even if privately). People are biased against saying “i don’t know”, because can be seen as a sign of lack of knowledge, but only after recognizing that we don’t fully understand something we can then actually start the process to find it out the truth.

Future predicting is tricky

Future guesses are overrated and less accurate as they seem.

We sometimes hear of a guess that predicted correctly the future, and praise it to no end. What we are missing are all the wrong guesses that we never hear about. Wrong guessing (in general) has no costs, people often forget prediction when wrong, but praise when right.

In essence: incentives are wrong, and biased toward positive guesses

Only way to know if future predicting is accurate is to keep track of both wrong and right guesses. Some studies mentioned in think like a freak book, say they are about the same as random guesses.

On Learning

A key to learn is feedback, we need to know if on the right path, have a success measure.

Experiments (A/B tests) are a way of learning. Sometimes a experiment where people know they are part of an experiment might influence the outcome. Observational Experiments are better, when we can infer something from a naturally occurring change, for example a particular cold winter would allow to compare behavior to a previous normal winter.

Look up literature on how to properly setup experiments, there are specific techniques that help: double blind experiments, A/A test, etc…

Understand Incentives

One of the cornerstones on why things happens. People respond to incentives.

Thats the whole way governs work, they change incentives so that people change habit and way of doing things.

More tax -> less purchases.

Less paperwork to create a company -> more companies will be created.

Lower the salaries for school teachers -> they will move to a better paying job when they can and potentially leave less able teachers in the teaching system (?)

Find Root Causes

Look for Root causes, not symptoms. Where is it coming from, where are the incentives, whats the past history on it, what influenced it, try to surface real background root causes.

Thing always have a reason to be, an explanation, that is often:

A consequence of:

Past history
Current incentives
- (sometimes) other variables

Variables might be unknown but do trust that they exist. (trust in science) And part of the work in data analysis is finding and verifying the variables.

past history example

A reason for many armed conflicts in Africa, is because it was divided artificially by colonizers, leaving different ethnic groups in same “country”, naturally ends up in war.

Structured data analysis

Have a structured way to approach data analysis problems. Be aware of a collection of tools and techniques that allow to tackle the problem.

Get time to think, blocked thinking time

Use a structured way to setting up the problem: properly define and re-define the problem, avoid the noise, try to well understand.

Tacking a problem

(remember the hot dog eating competition, from think like a freak book) Practice it, from the end user pov, question current practice, look for improvements and experiment them out in a scientific way, measure, analyze, ignore mental barriers (the current known limits), mental barriers are a self-made obstacle.

Be precise

When working with numbers, is important to be exact, have a sharp eye, be detailed and organized. Is very easy to miss something in the sea of numbers figures. Need to put on the OCD hat.

Look for the truth

This is a key part of doing data analysis. Keep it in mind and keep reminding it.

As the analysis gets more involved, as more emotions came into play, as people argue on different sides of it, as politics plays starts to influence sides, etc… The truth is the fixed solid ground that the analyst can (and should) hold on to and safely pursue.

Set Behavior Traps

Teach the garden to weed itself

Sometimes is possible to find creative ways to create traps where you can calculate that is very likely a person in fault will be caught up on.

Tries to tap into predictable people behaviors.

Examples: Brown M&M to check if people have fully read the conditions for Van Halen’s concerts, freak book suggesting terrorists to get life insurance, to make them stand out for a strong positive identification.

Communication

When telling a finding or trying to get resources required for an involved analysis, is needed to know how to pass on the message.

Is important to understand people, and to be able to communicate well.

When approaching someone:

Give downsides - there are always some, be honest.
Take into account the other persons opinion, might reveal new ideas also.
Avoid the name calling, negative comments push people into defensive.
Most important: Tell a story - That’s the best way to communicate, teach a concept in a memorable way.

Quitting and Investment

Only after trying many flavors we can then choose the best one.

Beware of the sink cost fallacy. Quitting forces going into new places and often opens up new possibilities.

On Investment

Investment science, is about buying into many properties and expect some of them to success and some of them to fault (and need quitting).

Reference

Any of the freakonomics book
Think like a freak for a how-to

Guesstimate

Imagine you toying around with some idea and come up with a new question, whats the size of that, whats the total impact, could this be relevant, i wonder how many times this happens ?

“I would imagine to be very low” - For example. This is a guesstimate, even if very somewhat vague, is the quickest, intuition driven way to size up something.

Guesstimate are useful in:

quick up front best guesses to sizes or impacts before looking at the data. - time saver
And also way to calculate values we don’t have data for. - too expensive data to get, sometimes a key factor

Often, people develop this skill naturally in specific areas when repeatedly exposed to it, for example imagine you want to keep track of caloric intake, after a year of using an application to look up each food calories, you will start naturally memorizing and developing intuition for how much calories are in each food type.

Structured Guesstimate

Intuition about the problem and creativity plays a big role, but with practice and structure this skill can be further developed and be a great tool to have. Also is fun and seems to be popular in modern times job interviews, as a way to evaluate creative and organized reasoning.

Techniques

Bound values: do a guess of upper and lower limits, start with extreme values, and keep closing the gap until not possible to know sure a thinner. Example: whats the size of a 10 story high building ? Well, for sure each floor has more than 1 meter high and less than 5m, so between 10m and 50m. Has to accommodate people at least 2m high but likely no more than 4m, so tighter guess is between 20m and 40m, etc…
Proxy, when not possible to get the data directly of what looking for, what is the closest data i have for it ?
Be aware of common sizes, that can be used as reference. Like if in the business of online marketing be aware whats the typical conversion rate from an email campaign for example.
Practice calculating numbers without a calculator, it will speed up guesstimate ability.

Reference

Data Analysis with Open Source Tools
http://www.analyticsvidhya.com/blog/2014/01/train-mind-analytical-thinking/

Storytelling and Presenting

2014-06-26T00:00:00+00:00

Effective Communications and Storytelling

aka: The Minto Pyramid

Good for “elevator pitch” also:

Start with the conclusion: “We should do this …”.
Breakdown the support points: “Because 1,2,3”.
Organize / structure thoughts as a Hierarchical structure (aka one sided mind map)
Breakdown each of the support points again if needed “1 is true because of 1.1, 1.2 and 1.3”, continue going down the hierarchy until no more questions appear.
Don’t use more than 5 supporting points (ideally 3 really).
In fact before doing the conclusion is better to do a bit of context, using the “The Problem System” described bellow.

The hierarchical structure

Meta structures:
- Cause & Effect: top is effect because of cause 1,2,3.
- Divide the whole into parts: top is total that is composed by 1,2,3.
- Classify: top is animal kingdom that is split into then Mammals,Fish,Reptiles.
- Logical Order, like time: top is result for that to happen then we need 1,2,3.
- Questions, Why? How?: these are very common, they fit into previous classifications.
- Decision trees function in the same way.

Document using Hierarchical structure

Top level is the main title, each subsequent level is a section, then subsection, etc…

Reference:

http://www.barbaraminto.com/
http://www.slideshare.net/shachihp/minto-pyramid-presentation

Presenting

Communications rely on 3 questions

What you want them to know ?
What you want them to do ?
How you want them to feel ? (emotional connection increases potentiality of message to stick)

BodyLanguage

Posture (Confident attitude):
- Stern up (chest), even when sitting down
Legs (stable solid position):
- Feet a shoulder apart (see Obama, James Bond, etc..), even when sitting down
- Don’t move around, moving is associated with fleeing, beware of that.
- No strange positions and movements, bobby, bouncer, Elvis knees (when sitting), etc…
Arms:
- BBC presenter hands.
- Palms down - certainty, for sure, absolutely, decisive, conclusive. End of a negotiation.
- Palms up - inviting, with possibility, typically an ask from audience.
Tension:
- Get a good balance of body tension, too much will became harsh (affects voice), too little does not look confident enough.
Eye contact:
- Constantly scan fully the room, make sure to get eye contact with everyone. People will start looking at the phones laptops without the constant eye contact.
- Don’t look too long same person.
Smile:
- Depending on the message, but almost always good in general.

Everyone will be a bit different and need to work on their weak spots.

Voice

Pitch
Avoid monotone voice, try keep some variation in voice.
Low pitch associated with seriousness.
High pitch associated with excitement.
Pace
slow = serious
fast = exciting
Pause
Add pauses especially for serious stuff, creates a break in the rhythm, gets attention.
Voice Projection (volume)
Strong in general is good
Passion
Introduces feeling, that can augment the message trying to pass on,
- Makes others trust you fully believing the message you transmitting.

Structured Presentations Systems

The Skeptical System

Works by answering typical skeptical questions from audience, how expert presenter is in the matter (establishing credibility), whats in it for me, etc…

Break the presentation in a sequence of 4 parts, answering the “skeptical” audience questions:

Why should i listen to you?
Credibility: i have many years of doing this … Part of a specialized team, etc..
Context, history
Whats in it for me ?
- You will get: …
How does it work ?
- Solution: By doing this …
- Number 3 facts (say 3 things, use fingers to enumerate, and last one is…)
- Case studies if needed (good to introduce emotion, personal stories, real events)
What now?
- 1 specific easy to do action to do today, to get towards the goal
- Thanks 2x: Thanks, let me know is there are questions, thanks.

To do the presentation, write a A4 paper separated into 4 areas, with the trigger words in each section, to guide the presentation.

The Problem System

Another way of structuring a presentation. This one is very often found in TV advertisement, even movies. Works by starting with creating the problem, create the need to be resolved and then serving the solution.

Problem
I would like to make you aware of a problem…
Somewhat taps into the fear natural system we have.
Even better when audience can relate to the problem as their own.
What if (promise of problem solving)
- What if i can tell you the problems identified before are all solvable
- Problem addressing.
Solution
The solution is …
Because … Enumerate 3 things. (say 3 things, use fingers to enumerate, say “and last one is”)
Action
- Offer 2 options, 1 is the real one, then offer 1 less good, that makes the real one look better. Ex: You can either buy 2 with a discount, or 1 without discount.
- Both options need to work in my favor.
- Thanks 2x: Thanks, let me know is there are questions, thanks.
Again is enough to use an A4 paper divided in 4 with the trigger words, to guide the presentation.

During presentation

Anchoring:
- Associate common gestures or expressions with anchors, of bad vs good, that can be used later on.
- Energy balls. Left down is bad, right up is good. Presenter needs to mirror the audience, so presenter right = bad, left = good.
Timeline:
- Hands moving left to right show timeline. Presenter needs to mirror audience.
Big and small
- Open including all audience means big things. (like the gain from the solution)
- 2 pinched fingers (thumb and index finger), means small meaningless cost.

Powerpoints

The person is the central point of the presentation, not the powerpoint.
Don’t just read what on the slide. (otherwise can just email the slides).
Use slides sequence as a storytelling structure.
Less is more. Delete everything non essential.
3 main bullet points per slide max, brake down into 3 more sub-points if needed.
Use images, covering the whole slide.
Eye candy matters.

Handling Objections

Empathize
I understand your problem
Avoid Buts: Yes… But… (don’t go into defense mode !). But=Nevertheless=”Don’t you think that”.
Clarify
Tell me more (palms up)
Use open ended questions, let the person speak.
Gives some thinking time.
Could be useful to disarm it, by asking rest of people in room if have the same problem.
Avoid saying too soon: lets handle this after the meeting, only in last case scenario.
Propose
Best way is to offer a good solution.
So if we do this will it help ?
Offer a solution: if we guarantee this… if we send you an email with the options by end of week, etc…
Check
Check that the person has got their answer and is fine with proposed solution, don’t move on without it. Even if someone else opens up more objections.

Arguments, tools and patterns of reasoning

2014-06-25T00:00:00+00:00

This structures how we build arguments, common strategies used, and what kind of questions to tackle when building arguments.

Arguments

Observations (collected data) alone are not enough to act on. When we connect observations to how the world works, we have the opportunity to make knowledge. Arguments are what makes knowledge out of observations.

Only in mathematics is it possible to demonstrate something beyond all doubt.

How an argument works: An argument moves from statements that the audience already believes into statements they do not yet believe. It moves from prior belief to a new belief to establish knowledge in a defensible way.

There needs to be something that the audience is tentatively willing to agree to, or else there is no way forward.

Another source of prior or background knowledge is commonly known facts.

In reality not all wisdom can be fully verified, and people rarely require omnipotence in practice.

Data Analysis is often an exercise in coming up the arguments that tackle a certain need and then use data and statistical data transformations as evidence to confirm those arguments.

Claims

Arguments are built around claims. Before hearing an argument, there are some statements the audience would not endorse. After all the analyzing, mapping, modeling, graphing, and final presentation of the results, we think they should agree to these statements. These are the claims. Put another way, a claim is a statement that could be reasonably doubted but that we believe we can make a case for.

Evidence

A key part of any argument is evidence. Claims do not demonstrate themselves. Evidence is the introduction of facts into an argument.

Transformations make data intelligible, allowing raw data to be incorporated into an argument. A transformation puts an interpretation on data by highlighting things that we take to be essential.

Justification

We need some justification of why this evidence should compel the audience to believe our claim. We need a reason, some logical connection, to tie the evidence to the claim. The reason that connects the evidence to the claim is called the justification.

Finally, all justifications provide some degree of certainty in their conclusions, ranging from possible, to probable, to very likely, to definite. This is known as the degree of qualification of an argument.

Rebuttals

There are always reasons why a justification won’t hold in a particular case, even if it is sound in general. Those reasons are called the rebuttals. A rebuttal is the yes-but-what-if question that naturally arises in any but the most self-evident arguments.

Patterns of Reasoning

Categories of Arguments

When making an argument try find what category does it fall into and what points of dispute need to be made clear.

A point of dispute is the part of an argument where the audience pushes back, the point where we actually need to make a case to win over the skeptical audience. All but the most trivial arguments make at least one point that an audience will be rightfully skeptical of.

A point of dispute will fall into one of four categories: fact, definition, value, and policy.

Stock issues tell us what we need to demonstrate in order to overcome the point of contention.

Fact

A dispute of fact turns on what is true, or on what has occurred. Such disagreements arise when there are concrete statements that the audience is not likely to believe without an argument. Some examples of disputes of fact: Did we have more returning customers this month than the last? Do children who use antibiotics get sick more frequently? The typical questions of science are disputes of fact.

Stock issues for disputes of fact:

What is a reasonable truth condition?
Is that truth condition satisfied?

Definition

Disputes of definition occur when there is a particular way we want to label something, and we expect that that label will be contested.

Stock issues with disputes of definition:

Does this definition make a meaningful distinction?
How well does this definition fit with prior ideas?
What, if any, are the reasonable alternatives, and why is this one better?

Value

When we are concerned with judging something, the dispute is one of value. For example, is a particular metric good for a business to use? We have to select our criteria of goodness, defend them, and check that they apply.

Which is more important, customer satisfaction or customer lifetime value? We often have to justify a judgment call.

For disputes of value, the stock issues are:

How do our goals determine which values are the most important for this argument?
Has the value been properly applied in this situation?

Policy

Disputes of policy occur whenever we want to answer the question, “Is this the right course of action?” or “Is this the right way of doing things?” Recognizing that a dispute is a dispute of policy can greatly simplify the process of using data to convince people of the necessity of making a change in an organization. Should we be reaching out to paying members more often by email? Should the Parks Department do more tree trimming?

Stock issues of disputes of policy are:

Is there a problem?
Where is credit or blame due?
Will the proposal solve it?
Will it be better on balance?

Aka: Ill, Blame, Cure, and Cost.

General Topics

Tools to build and reason about arguments.

Specific-to-general

A specific-to-general argument is one concerned with reasoning from examples in order to make a point about a larger pattern. The justification for such an argument is that specific examples are good examples of the whole. A particularly data-focused idea of a specific-to-general argument would be a statistical model. We are arguing from a small number of examples that a pattern will hold for a larger set of examples.

General-to-specific

General-to-specific arguments occur when we use beliefs about general patterns to infer results for particular examples. While it may not be true that a pattern holds for every case, it is at least plausible enough for us to draw the tentative conclusion that the pattern should hold for a particular example. For example, it is widely believed that companies experiencing rapid revenue growth have an easy time attracting investment. So if we demonstrate that a company is experiencing rapid revenue growth, it seems plausible to infer that the company will find it easy to raise money.

The archetypal rebuttal of a general-to-specific argument is that this particular example may not have the properties of the general pattern, and may be an outlier.

Argument By Analogy

Every mathematical model is an analogy. If we have two clients with a similar purchasing history to date, it seems reasonable to infer that after one client makes a big purchase, the other client may come soon after. The justification for argument by analogy is that if the things are alike in some ways, they will be alike in a new way under discussion.

The rebuttal for argument by analogy is the same as the rebuttal for general to specific arguments that what may hold for one thing does not necessarily hold for the other. Physical objects experience second-order effects that are not accounted for in the simplified physical model taken from an engineering textbook.

Special Arguments

Patterns of argument building that pop up in settings where we are using data professionally. But not only.

Optimization: An argument about optimization is an argument that we have figured out the best way to do something, given certain constraints.
Bounding Case: Sometimes an argument is not about making a case for a specific number or model, but about determining what the highest or lowest reasonable values of something might be.
Cost/Benefit Analysis: In a cost/benefit analysis, each possible outcome from a decision or group of decisions is put in terms of a common unit, like time, money, or lives saved. The justification is that the right decision is the one that maximizes the benefit (or minimizes the cost).

Reference

Thinking with Data, Max Shron: http://vimeo.com/98768831. And http://blog.mortardata.com/post/91270402361/max-shron-thinking-with-data-talk

(Identifying) Weak Reasoning

A fallacy is an argument that uses poor reasoning. An argument can be fallacious whether or not its conclusion is true.

http://en.wikipedia.org/wiki/Fallacy

Cognitive biases are tendencies to think in certain ways. Cognitive biases can lead to systematic deviations from a standard of rationality or good judgment, and are often studied in psychology and behavioral economics.

http://en.wikipedia.org/wiki/List_of_cognitive_biases

https://bookofbadarguments.com/#

Data Analysis Process

2014-06-24T00:00:00+00:00

Structured Data Analysis: CoNVO

CoNVO is a structured way to tackle data analysis problems. Uses a few steps to decompose the problem, structure the analysis, sketch out a solution and envision how is going to be acted upon. Is essentially looking first at the big picture and then rationally walk towards a solution to the problem.

A written down CoNVO is done before start implementing a data analysis solution.

Context

Context is the defining frame, that is apart from the particular problem we are interested in solving.

Who are the people / organizations involved?
What they do?
Their purpose/goal?
Who are the decision makers ?

Example:

This department in a large company handles marketing for a shoe manufacturer with a large online presence. The department’s end goal is to convince new customers to try its shoes and to convince existing customers to return again. The final decision maker is the VP of Marketing.

Need

What are the specific needs that could be fixed by intelligently using data? These needs should be presented in terms that are meaningful to the organization.

These are not well defined needs:

We need a weekly dashboard.
We need a daily report.

Well defined needs are instead:

Our customers leave our website too quickly, often after only reading one article. We don’t understand who they are, where they are from, or when they leave.
We want to decide between two competing vendors. Which is better for us?
Is this email campaign effective at raising revenue?

This step is about identifying the problem:

What is the problem / motivation?
The organization is not able to …
How can the organization do …

Hints:

Often the brief from the organization describing the need is somewhat biased and not complete.
Requires detailed understanding of the processes used.
What does the intuition tells us ?
- Talk to experts
- Do quick data exploration
This needs to be very clear, as is the basis for next steps.

Vision

The vision is a glimpse of what it will look like to meet the need with data. It consists of a mockup of the argument we’re going to make, that tackles the need.

Argument: Create an argument (a claim), that solves the need:
- By knowing … will help solve the problem because …
Vision: Define the Vision that solves it.
- example: A regular report with the KPIs
Mockup: Create a Visual mockup (if makes sense of the solution)

Hints:

Try find the simplest solution possible to address the need.
Look at a catalog of existing solution for similar problems. (Maybe keep and curate a catalog of known solutions).
Create a visual mockup of what will look like. - Paper, powerpoint, excel are all good tools for this.
Often requires a few iterations of re-defining the vision:
Do quick data explorations to get better intuition.
Have a Kitchen Sink interrogation (can open up new ideas).
Do we have all the data we need ?
Go into details, often limitations will force update the vision.
Can be useful to work backwards from the solution to find what is needed.
Simulate end user interaction with the mockup solution, see if it works.

KPI’s

Beware of the KPI rabbit hole: KPIs are incentives, and is fairly impossible to predict fully all effects of an incentive up front. Thus too narrow focus, especially on single KPIs can easily become counter-productive. => Use multiple KPIs, don’t forget common sense.

Examples:

Price of digital media entertainment per hour - http://gigaom.com/2013/02/10/cost-per-hour-a-new-metric-for-paid-content/
Food price per kg - because package size varies, the price per kilogram is a clear way to compare prices across foods and packages sizes.

Outcome

How the work will actually make it back to the rest of the organization and what will happen once is there. How will it be used?

How will the organization act on it ?
Once the organization acts on it, how we check it has been successful ? - plan a follow up study.
Who will handle its maintenance?
Who needs to be trained on it ? - Both on how to understand how to interpret it, and how to maintain it.

Implementing CoNVO

After the written down CoNVO, we then go into applying the sketched out vision:

Define timeline for each task, don’t let a particular one go on forever. Beware of rabbit holes. Do a Gantt chart, for example.
The implementation should all be focused on the argument (and mockup) defined. Should be about trying to find evidence to support the argument claim.
- Go collect the data, summarize it and confirm it supports the argument claim.
- On each step, wrote down the summary of the findings, is a way to clear up the result in the head and to layout next steps, often realizations come from this practice. Also good to avoid forgetting all the context, next time we need to look at it.
Include Validations steps:
- Be sure to check the assumptions.
- After each step, validate is correct (like QE)
- Check frequently with stakeholders that it is meeting their needs.
- pre-Mortem exercise: In contrast to the post-Mortem where you try figure out what went wrong after something went wrong, the pre-Mortem is a preventive simulation. Start the thought process as: Image the project went bad, what are the most likely causes of where it went bad, what are the weakest links, that should be strengthen up?
Communicate message in a polished way. Eye candy and selling presentation skills matters for the final output. Note, this is a whole separate phase from previous EDA (Exploratory data analysis), the EDA phase has the purpose of showing what things to focus one, most of exploration end up nowhere and are throwed away, this phase focus on best way of communicating things the EDA found.

Reference

Thinking with Data, Max Shron: http://vimeo.com/98768831. And http://blog.mortardata.com/post/91270402361/max-shron-thinking-with-data-talk

Music Recording Cookbook

2014-06-03T00:00:00+00:00

Start with the music production for general intro.

Song structure

There are different approaches but maybe the most common is to have sections of the song repeating, while typically the lyrics change. This also helps to hook in the listener with a common background pattern. This is called the Strophic form (derived from a Greek work, meaning “turn”) and the alternative is the through-composed form.

The parts the take “turns” by historical reasons are called Verse, Chorus, Bridge, pre-chorus, etc… But i like the approach of a generic naming of sections A, B, C, etc…

Commonly used song structures:

AABA: 32 form (8 bars each)
ABABB(B): verse-chorus style, B is the highlight
ABABCAB: C acts as a fairly different section (bridge) and then resolves back to the known (already heard) section.

But of course, songs don’t mandatorily need to follow a pre-determined structure, you can do whatever you like.

Reference:

http://en.wikipedia.org/wiki/Strophic_form
http://en.wikipedia.org/wiki/Song_structure_(popular_music)

Setup for recording

Audio Settings

File recording: all uncompressed
48,000Hz, 24bit, 128 (less if possible) buffer size.

Recording Raw

Recording unprocessed instrument is the most flexible for audio mastering editing. All the processing then happens in software plug-ins on top of already recorded instrument. See re-amping.

Recording Guitars

Widening

Record the same part 2 times into different tracks and pan one to right and the other one to left (80-95%). The very small differences in playing makes them sound big. But only sounds good if they are really close.
Record this with different sounds, different amplifier/cabinet combination.

This produces an a wider sound, less central and more stereo spread and bigger sound in general. Some people even 4 recordings for an even bigger effect.

(Fake) stereo with a delay

Delay one channel (either left or right) by about 20-30 ms, this will produce a wide stereo effect. Similar to what happens in reality that a sound reflecting on a wall reaches the other ear a few milliseconds after.
Use a stereo widening plug-in that adds a delay to either right of left channel to simulate space, and give a wider stereo sound. (http://www.kvraudio.com/product/adt_by_vacuumsound)

Mixing

Guitars especially with distortion tend to fill in a wide spectrum, so the idea in EQ’ing guitars is all about subtracting:

Don’t pile up instruments in the same range. Instead assign each instrument its own frequency range (typically an octave per instrument).
Apply high-pass filters to every single instrument. Shave off all the unnecesary bass frequencies below each sound.
Sometimes overlapping instruments is innevitable (e.g. bassline and kick drum). To separate overlapping instruments, use ‘inverted EQ’: apply a boost and a cut to the first instrument. Then copy that same EQ to the second instrument, but invert it - where you boosted by 2db, now cut by 2db. And where you cut by 2db, now boost by 2db.
Since all instruments overlap in the midrange, you can go through every instrument and cut slightly between the 200-500Hz range. This is a small effect on each instrument, but really adds up and results in a much cleaner mix.

Alternatively this can be done with a Multi-band compressor, for a more adaptive ‘fix’.

Reference:

HOW TO CREATE A CLEAN MIX / ARRANGEMENT: https://www.youtube.com/watch?v=G8lFAaANnhs

Soloing

Is easy for the Solo to get muddy when added on top of existing tracks of the same instrument (or the same frequency range).

Automate in DAW the lowering of the volume of the already recorded instruments of the same frequency range, when doing the solo, this helps the solo to stand out, example, lower the volume of the rhythm guitars parts when doing the guitar solo. Fairly easy to do in a DAW as a audio mastering technique. Leave the rest of the instruments at the same level.
Punch holes in the frequency range of song, to open up space for the soloing instrument. Find the frequency the soloing instrument uses and then compress those in the rest of the mix. Trick of the inverse EQ, mentioned in “Mixing” section

Song Sections Transitions

Accentuate the effect of music from one section to other, or to make a more natural transition.

Brief Mute

Right before a heavy hitting section mute completely for a few milliseconds (on the previous section), it accentuates the hit of the heavy section.

Volume roll in

To make a transition from quiet to loud dense sections, a good trick is to make the volume of the dense section, start low (at same level of quiet part) and raise up progressively to full. Start to end fairly quick, 1 sec max ?

Stutter Effect

Gaps in audio in fast rhythmic intervals

Tape slow down and speed up

Simulating a slow down like a old Cassette player loosing their batteries ?

Mixing

“Cleaner” Reverb

Use delay between (80-90ms), like left: 80ms, rigth: 90ms, with left right crossover and wet at 30%, add in as much (or as little) repeats as needed.

Mastering

When mastering use a commercial audio track as a reference to what is the ideal sound, A/B between the 2.

EQ Clean: remove ugly frequencies:
- Remove rumble bellow 50Hz
- Look for specific resonances to remove
Split: Center + Sides
- Remove highs from the center using a low pass filter, avoid a floppy low end.
- Remove lows from the sides, using a high pass filter, widen the sides a little by increasing volume.
Compress, glue things together (Vladg/sound Molot)
- Even better with a Dynamic, multi-band compressor, to work on different regions
EQ, give a little bump to 4Khz and 10KHz for that extra sizzle (lkjb Luftikus, TDR VOS Slick EQ)
- EQ like hifi, little color stronger low end, stronger high end (TDR VOS Slick EQ)
Color: Add Subtle warm saturation. Push sides highs up
Make it loud: Maximizer and limiter at the end (Vladg/sound Limiter No6)

Reference:

https://www.youtube.com/watch?v=Bah367_iLBg
https://www.youtube.com/watch?v=FzNweEPg-2U

Final output

I want to convert the final audio or video into different formats, for different purposes

Convert from wav to mp3

Export raw from daw.
Use Audacy with the lame encoder pack.

Convert videos in raw form to mp4

Export raw from daw
Convert to mp4 using Handbreak.

Reaper video export reference settings

Video (FFmpeg encoder) MKV container
HUFFYUV video codec, 24 bit PCM audio
Set: “Get native video settings”

Sound Quality Characteristics

Overdrive, Distortion, Fuzz

Rat: drive up to half, more than that starts to loose definition. To get more gain, use a cleaner (low drive) tube screamer to push it.
OCD should be more used as overdrive, too much drive start to loose definition. Works great with a BBpreamp pushing it.

Data Analysis Techniques

2014-04-08T00:00:00+00:00

Techniques to transform and interpret data in a useful way.

Quantify Impact

I want to, Understand how big is a change.

(i.e. calculate the relative change)

How

Lets say a bank raised a rate from 3% to 5%, difference is 2% but is ambiguous to say “the rate was increased by 2%”. It looks like a very small increase, from a +2 increase from an original of 3.

Absolute change was 2% but
Relative change (5 - 3) / abs(3) = 0.66(6), so rate was increased by 66%.

Relative calculation general formula: (new - reference) / abs(reference)

For negative numbers needs that denominator to have abs(), to force negative number into a positive one.

I want to, Understand how much is one number bigger than other

You looking at a table of numbers, and comparing 2 numbers, wondering by how much one is bigger than the other ?

Starting with: 22 out of 4236
approximate to a rounder number 20 out of 4000 (some precision is lost but depending on application that is often ok)
take out same number of 0’s from each (dividing by 10) 2 out of 400
divide to get the smaller number to 1, try divide by prime numbers under 10: 2, 3, 5 or 7 (in this case 2) 1 out of 200

I want to, quickly size up something to see how important it might be

The art of guess estimating can be very useful to quickly run numbers on what is going to happen, or impact, etc…

Data Quality: confidence interval

I want to

Give out some certainty on the average number i am giving out, by adding a range to where is this number is expected to fall into, most of the time.

How

When estimating a parameter like a mean from random samples, confidence interval helps quantify uncertainty, giving an interval in which we can expect the value to be most of the time.

Prepared Excel: confidence_interval.xlsm

Python:

def confidence_interval (s):
    from scipy import stats
    import scipy as sp
    import numpy as np
    import math
    n, min_max, mean, var, skew, kurt = stats.describe(np.array(s))
    std=math.sqrt(var)
    
    #The location (loc) keyword specifies the mean.
    #The scale (scale) keyword specifies the standard deviation.
    # We will assume a normal distribution
    R = stats.norm.interval(0.975,loc=mean,scale=std/math.sqrt(len(s)))
    return R

Data Quality: hypothesis testing

I want to, Make sure the results i got are statistically significant (and not due to chance).

Test of Proportion

For comparing percentages metrics like Conversion Rate, Click-through Rate etc, use a proportion test.

Finding Sample Size

To find what sample size is needed for a test, use:

In R stats:

To go from a 1.5% to a 2.5%, thus a 1% increase rate, at a 5% significance level and 80% power of test:

> power.prop.test(p1=0.015, p2=0.025, sig.level=0.05, power=0.8)
 
 Two-sample comparison of proportions power calculation 
 
          n = 3075.582
         p1 = 0.015
         p2 = 0.025
  sig.level = 0.05
      power = 0.8
alternative = two.sided

We need at least 3076 samples in each group.

In Python:

To go from a 1.5% to a 2.5%, thus a 1% increase rate, at a 5% significance level and 80% power of test:

> import statsmodels.stats.api as sms
> es = sms.proportion_effectsize(0.015, 0.025)
> sms.NormalIndPower().solve_power(es, power=0.8, alpha=0.05, ratio=1)

We need at least 3029 samples in each group. (not sure why they not exactly same as in R, might use slightly different calculation approximations)

Test if statistically significant

Chi-Squared Test

In R:

For example, to test two campaigns each with a 1000 displays, 32 and 54 conversions:

> prop.test(c(32, 54), c(1000,1000))
 
    2-sample test for equality of proportions with continuity correction
 
data:  c(32, 54) out of c(1000, 1000)
X-squared = 5.3583, df = 1, p-value = 0.02062
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.040754721 -0.003245279
sample estimates:
prop 1 prop 2 
 0.032  0.054

The p-value is less than 0.05, so we can reject the hypothesis that conversion rates are equal and assume the second group has a higher rate.

Prepared Excel: hypothesis_testing_proportion.xlsm

Prepared Excel: hypothesis testing proportion_Chi-squared.xlsm

Reference:

Comparing Population Proportions 1 - https://www.youtube.com/watch?v=a1Ye5RcWOqg
Comparing Population Proportions 2 - https://www.youtube.com/watch?v=MNbat1lrJW4
Hypothesis Test Comparing Population Proportions (part 3) - https://www.youtube.com/watch?v=dvSa_tx04hw

Test of Means

Comparing non-fractional values that follow a normal distribution (e.g. Average Order Value, Time Spent on Page etc.) is done with a Two-sample unpaired t-test.

Welch Two Sample t-test Two-Sample T-Test

Find Sample size

In R:

The recommended values for h are: 0.2 for small effects, 0.5 for medium and 0.8 for big effects. Sample call for computing the test sample size for a campaign with an estimated medium effect and 10000 customers in the control group:

> pwr.t2n.test(n1=10000, d=0.5, sig.level=0.05, power=0.90)
 
 t test power calculation 
 
         n1 = 10000
         n2 = 42.21519
          d = 0.5
  sig.level = 0.05
      power = 0.9
alternative = two.sided

Only around 43 customers are needed in the test group.

Test for significance

In R:

> t.test(group1, group2)
 
    Welch Two Sample t-test
 
data:  group1 and group2
t = -1.5631, df = 99.423, p-value = 0.1212
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -7.0219603  0.8334419
sample estimates:
mean of x mean of y 
 125.0789  128.1731

Our p-value is greater than 5% so we cannot reject the hypothesis that the values are different. More tests or a larger sample size may be necessary.

Reference:

http://www.marketingdistillery.com/2014/08/03/ab-tests-in-marketing-sample-size-and-significance-using-r/
http://www.evanmiller.org/ab-testing/t-test.html
http://stattrek.com/hypothesis-test/hypothesis-testing.aspx

Data Quality: Get an idea on the dispersion of the data

I want to

Get an idea on the dispersion of the data, what values are more common, what values are outliers.

See chart type Histogram and Boxplot from: http://al3xandr3.github.io/data-visualization-techniques.html

Simulations

This is a great way to test out models / hypothesis / behavior by generating random and (on purpose) biased data to test, evaluate formulas, algorithms, theories. And this is something that computers are great at, and that a few years ago was not possible to do.

This also allows for optimization, plot the convergence over time, see if possible to make the algol converge faster, plot the simulation progression over time.

Specifically, can be useful for:

Generate random data
Try to guess what should it look like.
Predict the future (based on the historical observations)
Simulate theories, instead of complex math: hypothesis testing alternative, resampling

Sometimes in paper is enough. Paper is costly (especially in time) to run many simulations. To get for example the distribution of the solutions. For more than a few executions, use a computer (Excel, Python, Ruby, R, etc…).

Example:

Given the history of values, what is the expected value in a week?
Given the history of values, in how many days is expected to reach a certain value ?
Trying to figure out whether the data im looking is being generated by sampling on visitors or being generated by sampling on visits, do a simulation on paper on what are the expected volumes sizes for each, see which one is closer.

Python Code: simulate.py

http://en.wikipedia.org/wiki/Resampling_(statistics)

Analysis: Reveal impacting features in a specific goal.

Lets say i am looking at a web site new user registration flow, and i want to find areas that can be improved, for example i might find that users in mobile devices in Portugal are having a hard time to register and that consequently might lead to the finding that a page in that flow is incorrectly translated.

For then we need to break down the data by the dimensions available and look at them combined, to find strong signals.

This is also useful to pin down performance problems or even outliers.

How to

Decide on the main goal to use, for example, the registration rate (completed registration / started registration)
Create (probably multiple) break downs by all the variables available to you (Country, Operating System, Device, Browser, Entry Page, Channel, referrer domain, etc…)

When doing breakdowns:

Keep in mind the context of the data, what dimensions make sense to look into and which ones it does not, start there.
Discard dimensions - Find if there’s enough volume in that dimension to be worth pursuing or not.
Discard dimensions values lower than a significant volume (rows of a dimension, keep only the top 100, or 1k, etc…), not worth the investment of fix for Antarctic where there’s only 1 person getting stuck in the registration. For example a column in Excel named Worth looking at? =IF(AND(G41>100, H41< 40%), TRUE, FALSE) - that is enough volume and bad performing, then label with True.
Play around with the breakdowns, try combining in different orders, find generalizations, example mobile vs non-mobile users, instead of looking at operating systems list (in case they are significant enough). Find the simplest breakdown possible to find strong signal, look to simplify. End up with a couple of high level dimensions, containing only a few values (mobile vs non-mobile and natural vs paid search, for example).
Choose dimensions and the way to analyze them on the way they can be acted upon, example android failing to register will mean a fix in the android dedicated web flow. For a bad performing referrer domain maybe makes more sense to pick up only the very successful ones and try to incentive those more, the bad performing ones, nothing much can be fixed there. - varies a lot on situation.

Classification

Create a classification, so that we enrich the data set with new useful features to analyze the rest of the data with. Also to use as modeling parameters.

This could include classifying for example: user behaviors, user value, etc…

Clustering and Similarity matching

Find what characteristics users that churned have (a churn model), then look for similar users cohorts from your user base that have similar characteristics and thus potentially at the risk of churning, create hypotheses on activities to help reduce churning.

Regression

Regression as an estimation / prediction technique

What results are we getting in 3 months if this trend continues? When are we getting to result XYZ?

Regression as a causal modeling technique (aka correlation finder)

Regression can be used in explaining how one or many (using multivariate regression) factors affect an outcome of interest.

For example, what contributes more to a movie final revenue is it the movie budget size or the viewers rating? Likely both things influence it… For example a movie with a very low budget can’t compete with a huge budget movie in general, but a big budget movie can also end up not being liked by viewers. So which of those 2 variables contribute the most to the final movie revenue ?

Here we can build a regression model that exactly quantifies how much these contribute to the revenue, given a set of historical data.

Data Cleaning: Fill in missing values

Python and Pandas:

You may use reindex() method of DataFrame:

x = pd.date_range('2013-01-01','2013-01-07',freq='D')
y = range(7)
df=pd.DataFrame(index=x,data=y,columns=['value'])

To add missing days (like holidays) you need to reindex it:

x2= pd.date_range('2013-01-01','2013-01-07',freq='4H')
df2=df.reindex(x2)

Then you may fill the gaps in values using interpolate() method of Series (different interpolation methods are available):

df2.value=df2.value.interpolate(method='linear')

Reference: http://stackoverflow.com/questions/20240749/pandas-dataframe-interpolating-missing-days

Optimization

Optimization Process

Baseline - where you are now?

what KPI’s are important to look at given the goal? Learn the domain, ask the big scope questions: what is the purpose of this? Why this exists? What do i need to look at to know the end goal is improving?
Do I have this data ? Need to add more instrumentation ?
Setup reports / dashboards

Formulate hypothesis on what to change to get to the goal

play with the data, to came up with theories and hypothesis to test, try correlations, plot, visualize, from different angles, maybe use also external data if possible

Test (A/B test)

do a change and see its impact on the KPI’s.
output a story, communicate, put in slides if needed, the results

Iterate

Go back to Baseline and start again. We using the right KPI’s ? (often the 1st ones are not great)
more hypothesis, more tests (once in a while might find something worthwhile … )

AB testing

Because correlation does not (always) imply causation, this is the way to test an hypothesis created from an observed data correlation. (output from a previous data science method).

This also, for example, the approach that the medical community uses to tests whether a new medication is effective or not. And is a fairly common solid scientific method.

Hypothesize how to make to make it better, run an experiment to validate it. Assist product changes by ab testing it, and measuring its consequences…

The setup of an experiment is to compare 2 groups where the variable we want to test is different for each group, while keeping all the other variables (possible confounding factors) the exact same. (see also https://en.wikipedia.org/wiki/Ceteris_paribus)

We essentially need 4 data points, these 2 metrics for each group:

Test group size (at the start of experiment)
Success group size (at the end of experiment)

We need to assure the experiment runs for enough amount of time so that we can be sure of the results (statistical significance). Apply a proportion test for example (https://www.khanacademy.org/math/probability/statistics-inferential/hypothesis-testing/v/large-sample-proportion-hypothesis-testing).

references: sample sizes http://camdp.com/blogs/number-samples-needed-b-test

Intelligent Agents Model

Amazingly, Optimization, AB & Multivariate Testing, Behavioral Targeting, Attribution, Predictive Analytics, … can all be recast as components of a simple, yet powerful framework borrowed from the field of Artificial Intelligence, the intelligent agent.

Artificial intelligence Optimization Process

The goals are what the agent wants to achieve, what it is striving to do.

When the agent achieves a goal, it gets a reward based on the value of the goal.

Given that the agent has a set of goals and allowable actions, the agents task is to learn what actions to take given its observations of the environment – so what it ‘sees’, ‘hears’, ‘feels’, etc… Assuming the agent is trying to maximize the total value of its goals over time, then it needs to select the action that maximizes this value, based on its observations.

So how does the agent determine how to act based on what it observes? The agent accomplishes this by taking the following basic steps:

Observe the environment to determine its current situation. You can think of this as data collection.
Refer to its internal model of the environment to select an action from the collection of allowable actions.
Take an action.
Observe of the environment again to determine its new situation. So, another round of data collection.
Evaluate the ‘goodness’ of its new situation – did it reach a goal, if not, does it seem closer or further away from reaching a goal then before it took the past action.
Update its internal model on how taking that action ‘moved’ it in the environment and if it helped it get or get closer to a goal. This is the learning step.

By repeating this process, the agent’s internal model of how the environment responses to each action continuously improves and better approximates each actions actual impact.

Reference: http://conductrics.com/intelligent-agents-ab-testing-user-targeting-and-predictive-analytics/

Tools

Tool is just the means to an end choose the best tool for the job at hand.

Excel is the notepad for data, great for a quick look, quick chart, prototype, final UI for a ad-hoc analysis, mostly used for EDA phase. Handles well small dashboards and reports, and some data manipulation, but from a certain size/complexity starts getting too time consuming and better to move on to next tool.

Python, is great on data ETL, data automation, handles building a data product with increasing complexity and size (as opposed to Excel) but also not too bad for EDA, especially with IPython. Fully automated dashboard that creates and html report, sends email to people, runs in a server by itself, with very low maintenance. Allows reproducible data analysis. Might miss all the absolute latest most complex machine learning algorithms.

R is the most complete statistical and machine learning algorithm collection available (?). Better for running algorithms that do a specific thing, not to good to build complex automations and logic on top (compared to python).

Web Analytics Dictionary

2014-03-02T00:00:00+00:00

Found some time ago the: http://www.webanalyticsdictionary.com/, which i like a lot as a source of web analytics definitions.

There is no in-site search mechanism, which i found frustrating, but alternatively is possible to use google search:

"Acknowledgement Page" site: www.webanalyticsdictionary.com

(as long as what you search for is already indexed by the google search engine)

Anyway, i decided to web scrape it for my own reference, and posting here if someone else finds it useful also.

( It could even be used to put on a memory repetition program like Anki, to help learning all the terms, or even for preparing for job interviews )

Web Scraping it

Step 1. Download the Internet to your computer (or, in this case, just the site)

wget -m -p -E -k -K -np http://www.webanalyticsdictionary.com/

“-m” mirror site.
“-E” put .html on unrecognized files.
“-k” convert the links in the document to make them suitable for local viewing.
“-K” When converting a file, back up the original version with a ‘.orig’ suffix.
“-np” Do not ever ascend to the parent directory when retrieving recursively.

Step 2. Extract the data

Step one has made a local copy of the site. Now we need to extract the useful data from it:

from bs4 import BeautifulSoup
import fnmatch
import os
import codecs

results = {}

for root, dirs, files in os.walk("."):
    for filename in fnmatch.filter(files, "index.html"):          
        f = open(os.path.join(root, filename), 'r')
        soup = BeautifulSoup(f.read())
        title = soup.select('.entry-title')
        content = soup.select('.entry-content')
        if title and content:        
            results[title[0].getText().strip()] = content[0].getText().encode('ascii', 'ignore').replace('\n', ' ')

out = codecs.open('webanalyticsdictionary.csv', 'w', encoding='utf-8')
for key, value in results.items():
    out.write(key + ", \"" + value + '\"\n')

I’m using the anaconda python all in one installation package to install both python and the libraries required.

Web Analytics Dictionary

Keyword	What is
% New Visits	Of all the visits to your site, what percentage of them came to your site for the first time. By itself, this is nearly useless. But, when paired with other stats it can be illuminating. For example, lets compare the number of new visitors to the number of page views. If your new visitor count is high and continues to grow, but your page views remains constant that would tell you that while you are attracting visitors, they are not coming back.
(PIE) Persistent Identification Element	is a type of tag that is attached a users browser, providing a unique ID for that visitor. A PIE is not unlike a cookie. Though that kind of thinking will never assure your progress as a pastry chef.
1 st Party Cookies	See First-party Cookies
112.2o7.net	The domain of the SiteCatalyst cookie that is set for report suites that target the Omniture Data Center in San Jose, California
122.2o7.net	The domain of the SiteCatalyst cookie that is set for report suites that target the Omniture Data Center in Dallas, Texas.
3 rd Party Cookies	See Third-party Cookies
A/B Testing	This includes testing different variations of a particular page A/B testing proves to be especially helpful when comparing the overall design or single elements
abandonment rate	The number of abandoned shopping carts vs. the number of completed transactions.
Abandonment	The number of customers who drop off during the process of conversion, like a half filled form or incomplete purchase.
ABC analysis	An approach for classifying accounts based on their attractiveness. A accounts are the most attractive while C accounts are the least attractive.
ABC inventory classification	A classification scheme used to implement inventory management strategies. Products are segmented into groups based upon unit sales or some other criterion. (For example, class A might be items with the highest frequency of sales, etc.) Inventory management is then guided by this segmentation.
Absolute URL’s Link	Absolute URLs use the full-path address, such as http://www.domain.com/page1.htm. (See also Relative URLs link.) Read more: http://www.sempo.org/?page=glossary#ixzz1ZyryDNXW
ACC	Average cost per click. The cost of a marketing campaign divided by the number of clicks generated. This metric is used to measure the effectiveness of your marketing campaigns. The campaign cost is taken from your campaign settings or is provided by search engines.
Account Activity Too	SiteCatalyst tool that enables you to view server calls per report suite.
Account Manager	An Account Manager is an Omniture support representative that offers help in answering questions and/or resolving issues for supported users with a support plan that offers them a dedicated Account Manager.
Account Rollup	See Rollup.
Accounts Authentication Service	A Google service that processes requests for access and issues authentication tokens that your client can then send to a particular Google service, such as the Custom Search Service. To learn more, seeGoogle Data APIs Authentication Overview.
Accuracy	The ability of a measurement to match the actual value of the quantity being measured. Accuracy is the foundation upon which your marketing analytics should be built. If you cant trust that your data is accurate, you cant make confident decisions. In statistical terms, accuracy is the width of the confidence interval for a desired confidence level. See alsoUnique Visitors.
Acknowledgement Page	A page displayed after a visitor completes an action or transaction. For example, a thank-you or a receipt page. Acknowledgement pages are often important inScenario Analysis, where it is an indicator of a completed scenario.
Acquisition Strategy	A process of finding those potential customers who are in the market and ready to buy. The attempt to lead customers to a web site and to welcome them, answer their questions and close the sale. Read more:http://www.sempo.org/?page=glossary#ixzz1Zys78U66
Acquisition	Acquisition refers to how successful you are at getting these visitors
ACT	After-Click Tracking is the recording the activity path of a visitor to a site after they have clicked on an email link.
action participation	Actions completed on any of the pages viewed during a visit. Whenever an action is completed on a target page, the action is counted for all pages viewed during the visit that contained the target page. Similarly, in case of document groups, action participation shows the number of actions completed on any of the pages of a document group. Whenever an action is completed on a target page, it is counted in all the document groups that contain the target page.
action	Actions performed by your visitors. If your actions are set to unique, then the Action metric shows the number of unique actions (e.g., if a visitor makes three purchases during a visit, only one unique action is counted). Otherwise, the action metric shows the total number of (non-unique) actions (e.g., if an action occurs more than once during a visit, it will be counted every time it occurs).
Actionable Data	Information that allows you to make a decision or can be made use of in any way.
Active Time / Engagement Time	Average amount of time that visitors spend actually interacting with content on a web page, based on mouse moves, clicks, hovers and scrolls. Unlike Session Duration and Page View Duration / Time on Page, this metric can accurately measure the length of engagement in the final page view.
Ad / Advert	A link that takes a visitor to a website when clicked on, usually graphic or text. See alsoBanner Ad.
Ad Campaigns	Integrate data from your banner and online Adwords with your site analytics data Discover the true worth of every potential customer that comes to your site by going beyond CPC At Outsource2india, we are vendor neutral. We have not restricted ourselves to one online Ad platform We can provide you with the competitive analysis that you require, whether you advertise with MSN, Google or Yahoo
Ad Click	A click on an advertisement on a website which takes a user to another site. See alsoAd View.
Ad Groups	A group of keywords within a campaign in Google.
Ad Hoc Query	A non-standard inquiry posed to a database of information as the need arises. See alsoQuery.
Ad View	A webpage that presents an ad. There may be more than one ad on an ad view. Once visitors have viewed an ad, they can click on it.
Ad-hoc query	A search request that retrieves information without knowledge of the underlying storage structures of the Entry archived,click hereto see full entry Aggregator: See feed reader.
add to cart	Number of times your visitors chose to add an item to their shopping cart.
Address Book	A software application that stores and manages personal contact information. Typically built into email packages.
Address	The unique location of a specific webpage. See alsoURL.
Admin Level	Google Analytics has two basic levels of access View Reports Only and Account Administrator. Users with View Reports Only access can view their Profiles reports and view and edit their own language preferences. All Account Administrators have
ADSL	Asymmetric Digital Subscriber Line. High-speed, always-on internet access via standard domestic telephone lines, usually faster in one direction. See alsoDSL.
Advanced Analysis	Generic industry term relating to the advanced analysis of your web site traffic data. With regard to SiteCatalyst, it is the tab that contains advanced analysis features, including Data Warehouse, ASI, Discover, and Multivariate Testing.
Advanced Analytics is suitable for	Websites with a lot of traffic, registration paths and diverse referrers and keywords.
Advanced Segment Insight (ASI)	ASI stands for Advanced Segment Insight. This tool is used to segment data retrospectively from the Data Warehouse and create the new custom segment in a SiteCatalyst Report Suite.
Advantages of logfile analysis	The main advantages of logfile analysis over page tagging are as follows: The web server normally already produces logfiles, so the raw data is already available. No changes to the website are required. The data is on the companys own servers, and is in a standard, rather than a proprietary, format. This makes it easy for a company to switch programs later, use several different programs, and analyze historical data with a new program. Logfiles contain information on visits from search engine spiders, which generally do not execute JavaScript on a page and are therefore not recorded by page tagging. Although these should not be reported as part of the human activity, it is useful information forsearch engine optimization. Logfiles require no additionalDNSLookups. Thus there are no external server calls which can slow page load speeds, or result in uncounted page views. The web server reliably records every transaction it makes, including e.g. serving PDF documents and content generated by scripts, and does not rely on the visitors browsers co-operating
Advantages of page tagging	The main advantages of page tagging over logfile analysis are as follows: Counting is activated by opening the page (given that the web client runs the tag scripts), not requesting it from the server. If a page is cached, it will not be counted by the server. Cached pages can account for up to one-third of all pageviews. Not counting cached pages seriously skews many site metrics. It is for this reason server-based log analysis is not considered suitable for analysis of human activity on websites. Data is gathered via a component (tag) in the page, usually written in JavaScript, though Java can be used, and increasingly Flash is used. JQuery and AJAX can also be used in conjunction with a server-side scripting language (such asPHP) to manipulate and (usually) store it in a database, basically enabling complete control over how the data is represented.[dubiousdiscuss] The script may have access to additional information on the web client or on the user, not sent in the query, such as visitors screen sizes and the price of the goods they purchased. Page tagging can report on events which do not involve a request to the web server, such as interactions withinFlashmovies, partial form completion, mouse events such as onClick, onMouseOver, onFocus, onBlur etc. The page tagging service manages the process of assigning cookies to visitors; with logfile analysis, the server has to be configured to do this. Page tagging is available to companies who do not have access to their own web servers. Lately page tagging has become a standard in web analytics[4].
Adware	Software code on a computer that transmits or receives data to or from a remote host for the primary commercial purpose of delivering advertising content or performing other advertising functions.
Affiliate Marketing / Affiliate Programme	A method of promoting web businesses in which an affiliate is rewarded for providing customers. Compensation could be made based on a value for visits, subscriptions, leads, sales, and so on. See alsoPPC.
Affiliate Marketing	A method of marketing where other websites can sign up to sell your products for a commission.
Aggregate Data	A summary of collected information which groups data together without individual-level statistics.
Aggregrate	Total site traffic for a defined period of time
Ajax Search API	A programming interface that lets you put Google search and Custom Search in your webpages. You can use JavaScript to embed search results in innovative, programmatic ways. To learn more, see theGoogle Ajax Search API.
AJAX	Asynchronous JavaScript and XML. Rich media technology for dynamically changing content on web pages based on user action, page action, etc.
Alert	Email notification that is sent when a specified metric exceeds a predefined threshold.
Algorithm	A mathematical formula used by search engines to determine which websites to present, in which order, in their search results. See alsoSearch Engine.
All Search Engines	SiteCatalyst report that shows the traffic sent to you by both paid and natural search engines.
Allocation	Allocation metrics are visit-based metrics in which the pages that visitors use result in success events, e.g. revenue or cart adds. For example, a visitor navigates through five pages of your site, and the visit results in a purchase of a $10,000 item. The allocation metric gives partial credit of the $10,000 to the five pages, so that each page receives $2,000 credit. If conversion is enabled, then allocation for pages is also automatically enabled. The linear allocation option for eVars follows this method.
Analogue Signal	A continuously-modulated waveform, such as that produced by a human voice or musical instrument. See alsoDigital Signal.
Analysis techniques	There are different frameworks to get analytics working for you, like accessing the server logs and deciphering the information, packet sniffing which sniffs incoming and outgoing data packets and reads the visitor information, image based tagging which uses a query string and runs data collection scripts which analyses user requests sent to the servers. All these techniques are like double edged swords; they cut through useful visitor information and also cuts-off useful visitor information. This is why setting analytical goals is very important or rather the only option to achieve effectiveness. But do not worry; we do have some pretty decent tools from vendors, who understand the nuances better. Let us now look at some of the tools and options available, mostly commercial.
Analysis	Studying the data on traffic to the website, visitor behavior and visitor responses to draw meaningful conclusions from these data. This analysis should help you in getting more business to your website.
Analytics	A Google product that provides detailed statistical reports about traffic to your website. To learn more, see theGoogle Analytics site.
Anchor	HTML tag < a href > that references a specific link contained within the content of a page.
annotation	A unit of information in theannotations file. An annotation comprises asitein your search engine and its associatedlabels. You can define annotations in theSitestab of the control panel or in the annotations file. To learn more, seeThe BasicsandSelecting Sites to Search.
annotations file	An XML or TSV file that lists sites (webpages or websites) you want your search engine to cover. You can think of it as the index of your search engine that contains information about which sites should be covered and how they should be ranked in the results. To learn more about creating the file, seeThe BasicsandSelecting Sites to Search.
Anti-Virus Programs	Updatable software that scans and monitors your computers memory for viruses, usually includes removal and recovery tools.
Apache	Apache is a free, open-source web server software system that is pervasive on UNIX, Linux, and similar operating system types. It is also available for Windows and other operating systems. Google Analytics admin system is powered by a variant of Apache. For more information, see Apache.org .
API	Application Programming Interface is a system that a computer or application supplies in order to allow requests for service to be made of it by other computer programs. APIs allow data to be exchanged between computer programs, and a standard software API method includes Open Database Connectivity
Applet	A small program that can be embedded in an html page, for a specific purpose, for example to communicate with a host server.
Application programming interface (API)	Functions of a computer program that can be accessed and used by other programs. For Entry archived,click hereto see full entry
Application	Any computer program, such as a word processor, email client, spreadsheet, database etc.
Arbitrage	A practice through which web publishers second tier search engines, directories and vertical search engines engage in the buying and reselling of web traffic. Typically, arbitrage occurs when such publishers pool client budgets to engage in PPC campaigns on Tier I search engines (Google, Yahoo!, MSN). If the publishers pay $0.10 per click for traffic, they typically resell those visitors to clients who bid $0.20 or more for the same keywords. Successful arbitrage requires that the arbitrageur must pay less per click than what the traffic sells for. The variation called Affiliate Arbitrage involves a web site owner or blogger bidding on keywords from programs such as Yahoo! Search Marketing or Google AdWords, who then links the ads, either to their own web site, or directly to a merchant site displaying ads (from programs such as the Yahoo! Publisher Network or Google AdSense).
area code	The 3-digit telephone prefix codes of your visitors coming from the U.S. and Canada.
ARPANET	Advanced Research Projects Administration Network. A communications network developed in the 1960s by the US defence department as a network which could survive a nuclear attack and later evolved into the Internet.
ASCII	American Standard Code for Information Interchange. The worldwide standard for the codes used by computer systems to represent all the upper and lower-case Latin letters, numbers, punctuation, etc. There are 128 standard ASCII codes each represented by a 7 digit binary number in the range 0000000 to 1111111.
ASI Slot	The ASI slot holds the segmented Advanced Segment Insight Data. It works much like a report suite, which regularly holds collected web analysis data.
ASP (Application Service Providers)	Third-party providers of software-based services and solutions. Distribution is made from one central data center across a wide area network. ASPs are a common tool companies use to outsource their various information technology needs.
ASP	Active Server Pages are a set of software components that run on a web server and let developers build dynamic webpages.
ASP.NET	A development framework created by Microsoft for creating dynamic web applications and web services. It Entry archived,click hereto see full entry ASPNET.MDF: A database file generated by ASP.NET to store user membership and profile information.
assist	A metric used to measure which campaigns (of any type) most often indirectly contribute to the conversions you value in order to better evaluate marketing opportunities. The metric is calculated based on the number of times a campaign drives a visitor to your website prior to the source credited with a conversion. Assists can be aggregated at higher level categories or viewed for each individual campaign being tracked. All assists occurring within 45 days of the conversion event will be recorded. Note also that a single marketing activity can assist multiple conversions. Please also note that individual keywords or other organic traffic sources are not tracked as assists.
Asymmetric encryption	An encryption method that uses different keys for encrypting and decrypting the data. Each party Entry archived,click hereto see full entry Asynchronous JavaScript and XML: see AJAX. Atom: A web feedstandard based on XML.
Attachment	A file sent together with a message, usually via email.
Attractive Statistics	Sawmills statistics are attractive. The tables are colored for easy reading, and the graphs are designed to be easily readable. Youll be able to take the reports right out of Sawmill and show them to your boss, or your investors, or anyone else, without having to reformat them to make them look good they already look good.
attribute	In XML, a property of an element. An attribute is made up of a name and a value (< element attribute_name=”value” >). To learn more about XML and attributes, seeThe Basics.
Attrition	The erosion of your customer base over time. The opposite of customer retention.
Auction Model Bidding	The most popular type of PPC bidding. First, an advertiser determines what maximum amount per click they are willing to spend for a keyword. If there is no competition for that keyword, the advertiser pays their bid, or less, for every click. If
Authentication	Technique by which access to Internet or intranet resources requires the user to enter a username and password.
Authorization	A security-related process that decides the privileges of a human user, process, or program that Entry archived,click hereto see full entry The definitions herein are compiled fromProfessional Search Engine Optimization with PHP(Wrox 2007) andProfessional Search Engine Optimization with ASP.NET(Wrox 2007). Now translated into 4 languages, these books are recognized as standard references when approaching search marketing from the technology angle.Click here to purchase either of these books from Amazon.com.
autocompletion	List of search queries that appear in your search box as users type. For example, as users start to type the first few characters of “google” in the search box, the search box automatically generates possible queries such as “google maps”, “google
Automated Reconciliations	The process of passing data between electronic systems so that the completeness and integrity of the data is verified during the transfer.
Automatic Optimization	Search engines identify which ad for an individual advertiser demonstrates the highest CTR (click-through rate) as time progresses, and then optimizes the ad serve, showing that ad more often than other ads in the same Ad Group/Ad Order.
Automatic PageNames Plug-in	Omniture plug-in that will dynamically set a page name based on the existing URL folder structure. The plug-in is also configurable to include/exclude query string parameters.
Autoresponder	A software facility which can handle electronic requests for brochures, product information etc. and automatically generates an email response.
Avatar	A graphical image of a user, such as used in graphical real-time Chat applications, or a graphical personification of a computer or a computer process, intended to make the computing or network environment a friendlier place.
Average Lifetime Value	The average of the lifetime value of a visitor or multiple visitors during the reporting period, where each visitors lifetime value is the total monetary value of a visitors past orders since visitor tracking began.
average order value (unique action)	Average value of a purchasing order for unique sale actions. Calculated by dividing the revenue amount (revenue) by the number of unique sale actions (unique actions).
average order value	Average value of a purchasing order for sale actions. Calculated by dividing the revenue amount (revenue) by the number of sale actions (action). If your sale actions are set to unique, then this metric is calculated by dividing the revenue amount by the number of unique sale actions. Otherwise, this metric is calculated by dividing the revenue amount (revenue) by the number of total (non-unique) sale actions.
Average Page Depth	The average number of pages on a site that visitors view during a single session.
Average Page Views per Visitor	The number of pages each visitor looks at on average.
Average Position	In SearchCenter, the average position of a search term in the search results over a period of time.
average product order value	Average order value of the products purchased. Calculated by dividing the total product revenue (product revenue) by the number of carts completed (cart complete).
Average Response Value	The average revenue value of each click, calculated as total revenue divided by total clicks.
average time on page	Average time your visitors spent on your web pages. Calculated by dividing the total time your visitors spent on your website by the number of page views.
Average Time On Site (AToS)	The average amount of time a user spends on a website.
average time per visit	Average duration of a visit. Calculated by dividing the total time your visitors spent on your website by the number of visits.
Average Time Spent on Site	Average session length per visit during the time frame in question.
AVS	Address Verification Service. A fraud-prevention service used to verify that the billing address and post code supplied by an online customer match the address registered with the card-issuing bank.
B2B	Business To Business. A business model for online trade between businesses. See alsoB2CandC2C.
B2C	Business To Consumer. A business model for online trade between organisations and individuals. See alsoB2BandC2C.
Back Button	Refers to the browser button that, when clicked, returns the user to the last page visited.
Backbone	A high-speed line or series of connections that forms a major pathway within a network, much as a motorway system comprises the major pathways in a road transport network.
background label	Seesearch engine label.
Banner Ad	A banner is a standard-format advertisement placed on a website either above, below or on the side of the main content, usually linked to the advertisers website. Banner ads may contain text, animated graphics, video and sound. The Interactive Advertising Bureau (IAB) has created a standard set of banner ad sizes (Medium Rectangle, Rectangle, Leaderboard, Wide Skyscraper) into a set of guidelines called the Universal Ad Package.
Base Report Type	The basic conversion statistics that are tracked on your site, such as products, campaigns or visit depths. Base Report Types appear in Products reports as the main category under which the graphed items fall.
Basic Code	Term used when referring to the SiteCatalyst code that is placed on the web page in order to enable Omniture to track the page.
Basic Subrelations	Enables some of the available conversion reports to be subrelated in SiteCatalyst. See Full Subrelations or Subrelations.
Batch	A group of transactions that have been captured and are waiting for processing.
BCC	Blind Carbon Copy. Facility within an email or messaging application to send multiple copies of a message, without revealing the list of addresses to each recipient.
Behavioral Targeting	The practice of targeting and serving ads to groups of people who exhibit similarities not only in their location, gender or age, but also in how they act and react in their online environment. Behaviors tracked and targeted include web site topic areas they frequently visit or subscribe to; subjects or content or shopping categories for which they have registered, profiled themselves or requested automatic updates and information, etc.
Benchmark	A standard by which something can be measured or judged. For example, Webtrends Analytics 8 helps you benchmark your Key Performance Indicators to ensure everyone in your organization is measuring performance against the same goals.
Benchmarking in Ecommerce	Discusses the issue of benchmarking, what benchmarks are available, how reliable they are, and what to do about it. Benchmarks for Websites
Best Practices Consulting	A premium service available to clients seeking assistance maximizing their online investments. As seasoned web analytics professionals, the Best Practices consultants assess clients strategic objectives, and deliver business analysis, and
Bid Boosting	A form of automated bid management that allows you to increase your bids when ads are served to someone whose age or gender matches your target market. This level of demographic focus and the bid boosting tool are current Microsoft adCenter offerings.
Bid Management Software	Software that manages PPC campaigns automatically, called either rules-based (with triggering rules or conditions set by the advertiser) or intelligent software (enacting real-time adjustments based on tracked conversions and competitor actions). Both types of automatic bid management programs monitor and change bid prices, pause campaigns, manage budget maximums, adjust multiple keyword bids based on CTR, position ranking and more.
Bid Management	Bid management enables you to manage the purchase keywords that are most important to you in a marketing campaign. You can monitor the availability and prices of available keywords by setting keyword purchase rules for keywords that become available on a specific search engine.
Bid	The maximum amount of money that an advertiser is willing to pay each time a searcher clicks on an ad. Bid prices can vary widely depending on competition from other advertisers and keyword popularity.
BigDaddy	An update to Googles ranking algorithms for web sites that occurred in early 2006. It Entry archived,click hereto see full entry
Billing Address	The address to which a customer’s credit card billing statement is mailed, used to verify that an online customer is the actual cardholder.
Biometrics	Methods of identifying unique human characteristics for security purposes, for example fingerprinting, iris scanning and dynamic signature verification.
Bit	The smallest measure of digital data, represented by a single binary digit, 0 or 1.
Black Box Algorithms	Black boxis technical jargon for a when system is viewed primarily in terms of input and output characteristics. Ablack box algorithmis one where the user cannot see the inner workings of the algorithm. All search engine algorithms are hidden.
Blacklists	A list of Web sites that are considered off limits or dangerous. A Web site can be placed on a blacklist because it is a fraudulent operation or because it exploits browser vulnerabilities to send spyware and other unwanted software to the user.
Blog/Web Logs	A self-published, managed or maintained Web diary. Usually updated daily or weekly, blogs have historically been personal, but gained notoriety after the 2004 election as an influential media outlet. Companies now use blogs to extend their brand and improve their organic search visibility.
Blogs	A truncated form for web log. A blog is a frequently updated journal that is intended for general public consumption. They usually represent the personality of the author or web site. A good source of blogging terms is at [www.whatis.techtarget.com] .
Bluetooth	A local wireless communications standard that allows electronic devices to share information, for instance to synchronise address books among PCs, PDAs and some mobile phones.
Bookends Pattern	SiteCatalyst report that lets you analyze what happens before and after a selected page.
Bookmark	A means of storing and indexing information, typically used by browser applications to store visited web addresses.
Bookmarks	A saved report in SiteCatalyst for easy access and later viewing. Bookmarking a report saves time because the parameters are already set you dont have to choose them over and over again.
Books	Coming soon
boost search engine	A search engine that does not exclude results. Instead, it searches across the web and promotes sites that you have tagged withboost labels.
boost	A mode for alabel.Sitestagged with boost labels are promoted in the search ranking. How much a site is promoted depends on theweightyou assign to the label and thescoreyou assign to the site. To learn more, seeChanging the Ranking of Your Search Results.
Bot	Seerobot.
Bots	Technology or humans used to artificially inflate traffic data to defraud advertisers and web sites that provide venues for advertisers.
Bounce Rate	Bounce rate is the percentage of single-page visits or visits in which the person left your site from the entrance (landing) page. Use this metric to measure visit quality a high bounce rate generally indicates that site entrance pages arent relevant to your visitors. The more compelling your landing pages, the more visitors will stay on your site and convert. You can minimize bounce rates by tailoring landing pages to each keyword and ad that you run. Landing pages should provide the information and services that were promised in the ad copy.
Bounce	SeeBounce RateandEmail Bounce.
BPS	Bits Per Second. A measurement of how fast data is pushed through a communications link, for example a 56Kbps modem is designed to move data at 56,000 bits per second.
Branch	In Page Flow Reports, branches show you the next and previous pages you can track from your starting page.
Brand and Branding	“A brand is a customer experience represented by a collection of images and ideas; often, it refers to a symbol such as a name, logo, slogan, and design scheme. Brand recognition and other reactions are created by the accumulation of experiences with
Brand Messaging	Creative messaging that presents and maintains a consistent corporate image across all media channels, including search.
Brand Reputation	The position a company brand occupies.
Brand	Customer or user experience represented by images and ideas, often referring to a symbol (name, logo, symbols, fonts, colors), a slogan and a design scheme. Brand recognition and other reactions are created by the accumulation of experiences with the specific product or service, both from its use, and as influenced by advertising, design and media commentary. Brand is often developed to represent implicit values, ideas and even personality.Source: Wikipedia
Branding Strategy	The attempt to develop a strong brand reputation on the web to increase brand recognition and create a significant volume of impressions.
BrandLift	A measurable increase in consumer recall for a specific, branded company, product or service. For example, brand lift might show an increase in respondents who think of Dell for computers, or WalMart for every household thing.
Bread Crumb	In a web page, a link-based navigation tool that displays your location in the content hierarchy of a site.
Breadcrumb navigation	Navigational links appearing on a web page that show the path taken to reach that Entry archived,click hereto see full entry
Breakdown	Term used to define the action of integrating two correlated items or reports. See Correlation.
Bricks And Mortar	Of the physical world, as opposed to the online environment, usually refers to High Street retail outlets.
Bridge Page	Often used to describe the web pages that linked together many doorway pages on a web site. Also see:Doorway Page,Hallway Page.
Brochureware	Website publicity material that is basically an adaptation of existing marketing tools, and not optimised for online interaction.
Broken Link	An incorrect HTML hyperlink that does not deliver the user to the specified location.
browse rate	Average number of pages viewed by your visitors. Calculated by dividing the number of page views by the number of visits.
Browse	The software used to access a website. Examples of browsers include Internet Explorer, Firefox, and Netscape.
Browser Cache (pronounced CASH)	An application that provides a way to look at and interact with all the information on the World Wide Web (also called a Web browser). It is technically a client program (such as Microsoft Internet Explorer or Netscape Navigator) that uses Hypertext Transfer Protocol (HTTP) to send requests to Web servers throughout the Internet on behalf of the user. A temporary storage location from which data can be retrieved. Data retrieval from cache is usually much faster than from the server. The Internet makes predominant use of the following two caching mechanisms: The Web browser stores recently downloaded pages in a temporary location on your hard drive. When you return to a page you’ve recently viewed, the browser quickly retrieves it from this local cache rather than the original server. You can usually change the size of your cache, depending on the browser you use. Most ISPs utilize caching servers on their network. When a browser requests a page, the page is downloaded to the individuals computer and also saved on the caching server. If a different user then requests the same page, it is then retrieved from one of the caching servers rather than the site servers. Many analysis tools (especially log file analyzers) fail to count page views from cached pages. SiteCatalysts proprietary cache-busting technology counts cached page views the same way it counts page views that have not been retrieved from cache, resulting in much more accurate Web site usage data.
Browser Heigh	Metric that refers to the vertical distance of the data in the browser window only. The toolbars, menus, buttons, etc. are all excluded as part of the browser height.
Browser Type	Metric that refers to the type of browser being used by the visitor; i.e. Internet Explorer, Mozilla, Firefox, etc.
browser version	The browser versions that your visitors use.
Browser Width	Metric that refers to the horizontal distance of the data in the browser window only.
Browser	Software application that can access and view information stored on websites and organised into HTML pages (Microsoft Internet Explorer for example).
Browsers	A browser, or more accurately, user agent, is the software used to access a website. Examples of user agents are Explorer (for Microsoft Internet Explorer), Netscape (for Netscape Navigator), and Googlebot (an automated robot that scours the web for website content to include in its search engine).
browsing hour	The local time when your visitors visit your website.
Bucket	An associative grouping for related concepts, keywords, behaviors and audience characteristics associated with your companys product or service. A virtual container of similar concepts used to develop PPC keywords, focus ad campaigns and target messages.
Business Intelligence	While some would claim it’s an oxymoron, business intelligence refers to a category of software and tools designed to gather, store, analyze, and deliver data in a user-friendly format to help organizations make more informed business decisions.
Business User	An individual that uses electronic systems for business or commercial purposes.
Buying Funnel	Also called the Buying Cycle, Buyer Decision Cycle and Sales Cycle, Buying Funnel refers to a multi-step process of a consumers path to purchase a product from awareness to education to preferences and intent to final purchase.
Buzz Monitoring Services	Services that will email a client regarding their status in an industry. Most buzz or publicity monitoring services will email anytime a companys name, executives, products, services or other keyword-based information on them are mentioned on the web. Some services charge a fee; others, such as Yahoo! and Google Alerts, are free.
Buzz Opportunities	Topics popular in the media and with specific audiences that receive news coverage or pass along recommendations that help increase exposure for a brand. Ways to uncover potential buzz opportunities include reviewing incoming traffic to a web site from organic links and developing new keywords to reach those visitors, or scanning special interest blogs and social media sites to learn what new topics attract rising interest, also to develop new keywords and messages.
Bytes	A byte is a unit of information transferred over a network (or stored on a hard drive or in memory). Every web page, image, or other type of file is composed of some number of bytes. Large files, such as video clips, may be composed of millions of bytes (megabytes). Since website and server performance is heavily affected by the amount of bytes transferred, and web hosting providers often charge according to this measure, it is very important for site owners to be aware of and understand. One byte is equal to 8 bits where each bit is either a one or zero. Common terms incorporating the word byte are: * Kilobytes 1,024 bytes * Megabyte 1,048,576 bytes * Gigabyte 1,073,741,824 bytes
C2C	Consumer to Consumer. A business model in which consumers transact with other consumers, e.g via eBay. See alsoB2BandB2C.
CA	Certification Authority. A trusted third-party organisation or company that issues digital certificates used to create digital signatures and public-private key pairs. The role of the CA in this process is to guarantee that the individual granted a unique certificate is, in fact, who he or she claims to be. Usually, the CA has an arrangement with a financial institution, such as a credit card company, which provides information to confirm an individuals claimed identity. CAs are crucial to data security and electronic commerce because they guarantee that parties exchanging information are genuine.
Cable Modem	A device that enables broadband internet connections over cable TV networks.
Cache	A temporary storage area that a web browser or service provider uses to store common pages and graphics that have been recently opened. The cache enables the browser to quickly reload pages and images that were recently viewed.
caching (web caching)	A method used by web browsers and web proxy servers to store previous responses from web servers, such as web pages.
Calendar Tool	SiteCatalyst tool that enables you to select the timeframe for which data will be displayed in a selected report. You can select days, months, years, or a specified, granular time frame; i.e. May 3, 2005 to May 8, 2005.
CAM	Card Authentication Method. A method by which a plastic card is determined genuine and not counterfeit. Current chip cards provide the best CAM available.
Campaign Analysis	A Webtrends feature that tracks activity originating from a marketing campaign, so you can compare your campaigns and evaluate their effectiveness. Campaigns are tracked using a Webtrends query parameter on the marketing campaign landing page.
Campaign Classification Manager	SiteCatalyst tool used to create and manage classifications. For more information, refer to Classifications.
campaign clicks	Total number of clicks performed by your campaign visitors.
Campaign Container	Each region of the page that holds targeted content is a campaign container. The campaign container is usually a bounded region of the page (it may help to visualize this as similar to a banner ad, although it can take any format, including a sentence of text in the middle of a paragraph).
Campaign Integration	Planning and executing a paid search campaign concurrently with other marketing initiatives, online or offline, or both. More than simply launching simultaneous campaigns, true paid search integration takes all marketing initiatives into consideration prior to launch, such as consistent messaging and image, driving offline conversions, supporting brand awareness, increasing response rates and contributing to ROI business goals.
Campaign Option	The campaign container can display one of more campaign options selected by the targeting engine. A campaign option is an individual creative. It may be a single image, like a banner, or it can be a complex piece of HTML, including text, images, and multimedia content.
campaign Variable	The campaign variable identifies marketing campaigns used to bring visitors to your site. The value of campaign is usually taken from a query string parameter.
Campaign	A marketing effort used to bring visitors to a specific web site. Also, a product feature or advertised product concept. If a campaign option were advertising a credit card, the campaign would be a series of creatives advertising its interest rate. A second campaign would be a series of creatives to advertise any value added services that come with the card.
campaign	Your online marketing campaigns. Currently, you can schedule the following types of campaigns: paid search, banner campaigns, email campaigns, affiliate programs, and other.
Campaign-specific Metrics	Campaign-specific metrics are fixed numeric values associated with a campaign, such as the hard cost for a campaign.
Captcha	Completely Automated Public Turing Test to Tell Computers and Human Apart. Webmasters employ this method to tell computers and humans apart requiring the user to view a code and type it in an entry box. See alsoWebmaster.
Captured	The status of a transaction that has gone through Auth or PostAuth and is awaiting processing. See alsoAuthandPostAuth.
Card Issuer	A financial institution that provides cards via a contractual relationship with a cardholder.
Card Refund	Refund to a customer, created by return of goods or services paid for by a payment card.
Card Schemes	Organisations that manage and control the operation and clearing of card transactions, e.g. MasterCard, Visa, Switch, American Express, Diners Club International.
Card-Not-Present	A type of credit card transaction that does not require the cardholder and card to be physically present. All online transactions are card-not-present transactions.
Card-Present	A type of credit card transaction in which the cardholder and the card are physically present, for example in a shop. Typically, the credit card passes through a card reader and the cardholder enters their PIN or signs a form.
Carrier	Organisation that provides delivery services direct to the consumer.
Cart Additions	Refers to additions to the shopping cart being used by the visitor when making purchases on a web site.
cart complete	Number of completed carts for a product. Cart Complete is counted as a unique action, i.e., only one completed cart is counted per visit, although more product carts might have been completed during the visit.
Cart Open	Events in which the cart object is created for a customer. Cart opens usually occur when a customer selects an item for purchase, but can occur without an item as well.
Cart Removal	Refers to removals from the shopping cart being used by the visitor when making purchases on a web site.
Cart Views	Event in which the contents of the shopping cart are viewed by the customer.
Cart	See Shopping Cart
Cascading Style Sheets or CSS	An addition to your HTML, a web sites cascading style sheet contains information on paragraph layout, font sizes, colors, etc. A cascading style sheet has many uses as far as search engine optimization and web site design are concerned.
Categories	Similar groups of products, such as electronics or music (much like channels or content groups in the Traffic Reports). Categories can contain multiple products, for example, the Electronics Category can contain televisions, radios, and computers.
CC	Carbon Copy. Used to send a copy of an email message to more than one recipient.
Certificate Signing Request (CSR)	A text file generated by the web server on which the SSL certificate will be installed. A Certificate Authority uses this text file to create a “signed” SSL certificate.
CGI Script	A CGI script is a program written in one of several popular languages such as Perl, PHP, Python, etc., that can take input from a web page, do something with the data, and produce a customized result (among many other possible uses). CGI scripts are widely used to add dynamic behavior to websites and to process forms.
CGI	Common Gateway Interface. Software protocols used by a web server to communicate with other applications hosted on the same machine.
Change Controls	Definition of a person who has the ability to make changes to data being stored and programs that are being used.
Channel	Defined sections, or categories, of your site. Web sites that have two main categories, such as weather and news, have two channels. SiteCatalyst enables you to group statistics for all page views that occur within any channel in your site.
Chargeback	The reversal of a previously Settled transaction in which the merchant bank debits the amount of the sale from the merchants account because the cardholder has disputed the charge.
Chat Program	An application that allows real-time text messaging between two or more users connected over a network.
Chat Room	A virtual location, hosted on an internet server, where users can join a room and send real-time text messages to other visitors in the same room.
Chip card	A plastic payment card containing a microchip which has memory and processing capabilities, recognisable by a gold contact plate that allows communication with a card reader. Also known as integrated circuit cards (ICCs) or smart cards.
Churn	The tendency for subscribers to switch between service providers, whether prompted by costs, quality of service levels or suitability of subscription packages.
CIFAS	CIFAS is the UKs Fraud Prevention Service. They can be found atwww.cifas.org.uk.
city	The cities your visitors access your website from.
Click analytics	Click analyticsis a special type of web analytics that gives special attention toclicks. Commonly,click analyticsfocuses on on-site analytics. An editor of a web site uses click analytics to determine the performance of his or her particular site, with regards to where the users of the site are clicking. Also,click analyticsmay happen real-time or unreal-time, depending on the type of information sought. Typically, front-page editors on high-traffic news media sites will want to monitor their pages in real-time, to optimize the content. Editors, designers or other types of stakeholders may analyze clicks on a wider time frame to aid them assess performance of writers, design elements or advertisements etc. Data about clicks may be gathered in at least two ways. Ideally, a click is logged when it occurs, and this method requires some functionality that picks up relevant information when the event occurs. Alternatively, one may institute the assumption that a page view is a result of a click, and therefore log a simulated click that lead to that page view.
Click Bot	A program generally used to artificially click on paid listings within the engines in order to artificially inflate click amounts.
Click Fraud	A type of internet crime that occurs in pay per click online advertising when a person, automated script, or computer program imitates a legitimate user of a web browser clicking on an ad, for the purpose of generating a charge per click without having actual interest in the target of the ads link.
Click path	the sequence of hyperlinks one or more website visitors follows on a given site.
Click Through Rate (CTR)	The percentage of known impressions that result in clicks.
Click Through	When a user clicks on a hypertext link and is taken to the destination of that link
Click throughs	The process of a visitor clicking on a particular link, be it to move to another website, another page of the same website, or even the same page.
Click	refers to a single instance of a user following a hyperlink from one page in a site to another. A growing community of web site editors use click analytics to analyze their web sites.
Click-And-Mortar	Click-and-mortar describes a store that exists both online and in the physical world, for example John Lewis.
Clickstream	(alt. to click path) is the recording of what a computer user clicks on while Web browsing or using another software application. As the user clicks anywhere in the webpage or application, the action is logged on a client or inside the Web server, as well as possibly the Web browser, routers, or ad servers
Client Error	An error that occurs because of an invalid request by the visitors browser. See alsoreturn code.
Client	Thebrowserused by a website visitor.
client-side scripting	Type of computer program that is executed by the client, or web browser, instead of the server. It enables webpages to be dynamic, that is, able to interact with the input of the users or some variable (such as locale of the user or the version of the browser). For example, a dynamic webpage could change its contents (say, a schedule of interesting local events) based on the entries the user has submitted on a membership form. You can write client-side scripting using programming languages such as JavaScript or VBScript.
Client-side Tracking	Client-side tracking entails the process of tagging every page that requires tracking on the Web site with a block of JavaScript code. This method is cookie based (available as first or third party cookies) and is readily available to companies who do not own or manage their own servers.
Cloaking	The process by which a web site can display different versions of a web page under different circumstances. It is primarily used to show an optimized or a content-rich page to the search engines and a different page to humans. Most major search
CNAME	A Canonical NAME record makes one domain name an alias of another.
COA	Acronym for Cost of Acquisition, which is how much it costs to acquire a conversion (desired action), such as a sale.
Code	Anything written in a language intended for computers to interpret.
color coding	Color coding is a feature, available in the reporting interface, that lets you set automatic color-coding for displayed metric values. The color-coding is set up based on specified thresholds.
color palettes	The different color palettes active on your visitors computers.
comment	Text associated with a label. Comments are for your benefit; your users cant see them.
Commerce Server	Software that manages the main functions of an online store, such as product display, online ordering, and inventory management. The software works in conjunction with online systems to process card payments.
Competitive Analysis	As used in SEO, CA is the assessment and analysis of strengths and weaknesses of competing web sites, including identifying traffic patterns, major traffic sources, and keyword selection.
Complete Path	Set of SiteCatalyst pathing reports that enable you to view such metrics as path length, longest path, full path, et cetera.
Compression Programs	Freeware or shareware applications that are able to read and reorganise one or more data files into a more compact form, usually recognised by a .zip or .rar suffix. To reduce connect time, downloadable applications and large file transfers are often compressed in this way.
connection type	Type of connection your visitors use to access your website.
Connection Types	SiteCatalyst report that displays metrics for Internet connection speed; i.e. modem, LAN, et cetera.
Consumer Generated Media (CGM)	Refers to posts made by consumers to support or oppose products, web sites, or companies, which are very powerful when it comes to company image. It can reach a large audience and, therefore, may change your business overnight.
Consumer	An individual who purchases products and/or services for private use.
Contact Name	This is the real name (generally speaking) of the user to whom you have given access to a particular Google Analytics report. The contact name can have spaces in it, and it is not case-sensitive.
Contact	The basic aim of the web analytics dictionary website is to provide a handy source to its visitors about web analytics terms, definitions, glossary- all in one place. We have tried to make it as comprehensive as possible. In our effort to do so, we have complied the definitions, tetrms from various online and offline sources. Please feel free to comment on the site or suggest a new term(s) which we may have missed. Contact Us via email at info@webalanyticsdictionary.com
Content (A/B) Testing	Testing the relative effectiveness of multiple versions of the same advertisement, or other content, in referring visitors to a site. Multiple versions of content can be uniquely identified by using a utm_content variable in the URL tag.
Content (Campaign Tracking)	Content is the label for each version of an advertisement. The UTM variable for content, utm_content, indicates which version of a link the visitor clicked on to reach a web site for example, utm_content=graphic_version1a.
Content Management Systems (CMS)	In computing, a content management system (CMS) is a document centric collaborative application for managing documents and other content. A CMS is often a web application and often it is used as a method of managing web sites and web content. The market for content management systems remains fragmented, with many open source and proprietary solutions available. Source: Wikipedia.org
Content Network	Also calledContextual Networks, content networks include Google and Yahoo! Contextual Search networks that serve paid search ads triggered by keywords related to the page content a user is viewing.
Content Provider	An organisation or individual that devises content for an online enterprise, in the form of webpages, ads, audio, video, newswires, listings etc. The content may be available free of charge or for a fee.
Content Targeting	An ad serving process in Google and Yahoo! that displays keyword triggered ads related to the content or subject (context) of the web site a user is viewing. Contrast to search network serves, in which an ad is displayed when a user types a keyword into the search box of a search engine or one of its partner sites.
Content	is one of the five dimensions of campaign tracking; the other four are source, medium, campaign, and term.
Content-targeted advertising	An advertising model in which the publisher displays related advertising and content together.
context file	An XML file that defines the specifications of your search engine. It acts as the control center, houses labels, and specifies the global settings for a search engine. To learn more about creating the file, seeThe BasicsandDefining Your Search Engine Specifications.
context	Defines the infrastructure of your search engine, such as how results page should look, what the labels are, how pages should be ranked, and so on. You define the context of a search engine in theBasicstab of the control panel or in thecontext file. See alsoannotations. To learn more about context, seeThe BasicsandSelecting Sites to Search.
Contextual Advertising	Advertising that is automatically served or placed on a web page based on the pages content, keywords and phrases. Contrast to a SERP (search engine result page) ad display. For example, contextual ads for digital cameras would be shown on a page with an article about photography, not because the user entered digital cameras in a search box.
Contextual Distribution	The marketing decision to display search ads on certain publisher sites across the web instead of, or in addition to, placing PPC ads on search networks.
Contextual Link Ads/Inventory	To supplement their business models, certain text-link advertising networks (like Google) have expanded their network distribution to include contextual inventory. Most vendors of search engine traffic have expanded the definition of Search Engine Marketing to include this contextual inventory. Contextual or content inventory is generated when listings are displayed on pages of Web sites (usually not search engines), where the written content on the page indicates to the ad-server that the page is a good match to specific keywords and phrases. Often this matching method is validated by measuring the number of times a viewer clicks on the displayed ad. These ads typically do not perform as well as traditional text ads on search engines, but the lower cost justifies the expense.
Contextual Search Campaigns	A paid placement search campaign that takes a search ad listing beyond search engine results pages and onto the sites of matched content web partners.
control panel	Thewebpagesthat let you manage your search engines and define their settings. To learn more, see theGetting Startedpage.
conversion (campaign conversion)	Conversion rate of the campaign-generated actions. Calculated by dividing the number of actions by the number of campaign clicks. If your actions are set to unique, then this metric is calculated by dividing the number of unique actions by the number
Conversion Action	The desired action you want a visitor to take on your site. Includes purchase, subscription to the company newsletter, request for follow-up or more information (lead generation), download of a company free offer (research results, a video or a tool), subscription to company updates and news.
Conversion Funnel	The series of pages that move a visitor towards a pre-defined action.
Conversion Navigator	SiteCatalyst page that displays the Conversion reports that are available to your given implementation.
Conversion Rate Improvement	An improved conversion rate means increased income without increased expenditure. I can use my knowledge of web analytics to understand how people interact with your site. I can then recommend changes to the site to improve its conversion rate. I have proven experience in this field (read mycase study). I typically improve sales 5 or 10-fold in the first 3 months. Im so confident of my ability Im happy to base my fees on the improvements I make. Contact meto discuss your requirements.
Conversion Rate	Determines the effectiveness of your web page in converting visitors to sales or leads. Percentage of clicks that converted to actions. (# actions / # clicks)
Conversion Ratio	See Conversion Rate
Conversion Variable	The Custom Insight Conversion Variable (or eVar) is placed in the Omniture code on selected web pages of your site. Its primary purpose is to segment conversion success metrics in SiteCatalyst custom reports. Evar variables can be visit-based and can function similarly to cookies on the site. Values passed into eVar variables follow the user for a predetermined period of time, based on configurations made in the Settings Tab.
Conversion	A conversion is said to occur when a visitor completes an activity that you have identified as important. This activity could be a purchase, an email list registration, a download, or viewing an online presentation. When you sign up for Google Analytics, you have the opportunity to specify your goal pages pages that a visitor can only reach by completing a conversion activity. If you use Urchin Software, you set your goal pages within a profile.
Conversions and Averages	SiteCatalyst report that shows revenue based on specified events, and shows drop-out average from event to event.
Convert	See conversion.
cookie (first party)	A cookie that is created by the website you are currently visiting.
cookie (http cookie)	A small text file sent to your browser by a web server. Its used for authentication, session tracking, and maintaining user-specific information about users, e.g., site preferences, shopping cart contents, etc.
Cookie Combining Utility Plug-in	This cookieCombiningUtility will reduce the number of cookies set by Omniture’s code. Data for all cookies will be stored within one session cookie and one persistent cookie.
cookie support	Whether cookies are enabled on your visitors browsers or not.
Cookie	A small amount of text data given to a web browser by a web server. The data is stored and returned to the specific web server each time the browser requests a page from that server. The main purpose of cookies is to pass a unique identifier to the
cookieDomainPeriods Variable	The cookieDomainPeriods variable is used to determine the domain with which cookies will be set. The name cookieDomainPeriods refers to the number of periods in the domain when the domain begins with www For example, the www.mysite.com contains two periods (..), while www.mysite.co.jp contains three periods. Another way to describe the variable is the number of sections in the main domain of the site (two for mysite.com and three for mysite.co.jp)
cookieLifetime Variabl	The cookieLifetime variable is used by both JavaScript and SiteCatalyst servers in determining the lifespan of a cookie.
Core Site	A single site that represents key business requirements.
Correlation	Correlation Reports are particularly useful for understanding the relationships between two or more Traffic Custom Insight variables, or other system variables. Correlations come in three sizes (2, 5, and 20), based on the number of items that are correlated together at the same time. Correlation groups with more than 500,000 unique combinations of values cannot be correlated within the real time interface.
cost (campaign cost)	Amount spent on the advertising campaign during the selected period. The campaign cost is taken from your campaign settings or is provided by search engines.
cost consolidation	Cost consolidation refers to a capability, which you can enable or disable for your campaigns, that automatically consolidates the paid search results obtained by our tracking system with click and cost data provided by the paid search engines. This way, your campaign data should more closely reflect what your paid search vendors report.Your current days data is that data collected by our tracking system (since most search engines are unable to provide same day reporting). Once every 24 hours, we consolidate this data with the cost and click data reported by the paid search engines. (Please note the search engines may occasionally need up to 18 hours to provide that data.) All this happens automatically, but you can choose which data to display. If you enable cost consolidation, metrics such as CPA (cost per action), ROAS (return on ad spend), and conversion rates will be calculated based on consolidated data. When this feature is not enabled, youre looking at the data our tracking system collected.
Cost-per-click (CPC)	An advertising model in which the advertiser (sponsor) pays the publisher a certain amount each time the sponsors ad is clicked. Also sometimes referred to as PPC (pay-per-click).
Cost-per-Thousand (CPM)	System where an advertiser pays an agreed amount for the number of times their ad is seen by a consumer, regardless of the consumers subsequent action. This term is heavily used in print, broadcasting and direct marketing, as well as with online banner ad sales. CPM stands for cost per thousand, since ad views are often sold in blocks of 1,000. The M in CPM is Latin for thousand.
Counterfeit Card	A fake credit or debit card which has been printed, embossed or encoded so as to appear legitimate, or a valid card that has been altered or re-encoded with fake data.
Courses	Coming Soon
CPA (Cost per action)	A measure of how much it cost in ad spend to acquire a new lead or sale. Also known as cost per acquisition; cost per lead; cost per order. (Click charges / # of actions)
CPA or “Cost Per Acquisition”	Also referred to as Cost Per Action. This is a metric used to measure the total monetary cost of each sale, lead or action from start to finish.
CPA	Cost per action. The cost of your advertising campaigns divided by the number of actions. This metric shows the effectiveness of your marketing campaigns. The campaign cost is taken from your campaign settings or is provided by search engines.
CPC (Cost per click)	A measure of how much you paid on average for each click. (Click charges / # of clicks)
CPC or “Cost Per Click”	Some search engines charge advertisers a cost for every click sent to their web site. The CPC is the total cost for each click received.
CPC	Cost per Click. An advertising model in which the advertiser (sponsor) pays the publisher a certain amount each time the sponsor’s ad is clicked. Also sometimes referred to as PPC (pay-per-click).
CPG	Consumer Packaged Goods
CPL	Cost per Lead. The cost for gaining a lead to a new customer.
CPM or “Cost Per Thousand”	A unit of measure typically assigned to the cost of displaying an ad. If an ad appears on a web page 1,000 times and costs $5, then the CPM would be $5. In this instance, every 1,000 times an ad appeared, it would incur a charge of $5.
CPM	Cost per million. Pertains to instances in which the code on the clients web page generates a server call to Omniture, e.g. image request.
CPO	Acronym for Cost Per Order. The dollar amount of advertising or marketing necessary to acquire an order. Calculated by dividing marketing expenses by the number of orders. Also referred to as CPA (Cost Per Acquisition).
Crawler	An automated program used primarily by search engines and other services to gather information from the World Wide Web.
Crawler/Spider/Robot	Component of search engine that indexes web sites automatically. A search engines crawler (also called a spider or robot), copies web page source code into its index database and follows links to other web pages.
Creative Element	Creative elements classifications are characteristics that vary between placements or instances of the campaign, and include characteristics such as media type, headline, keyword, and media vendor.
Creative	For the purposes of web analyitcs, “creative” describes the characteristics of a marketing activity, such as color, size and messaging-for example, a “Buy Now” graphic.
Creatives	Refers to the team in your organization that works to develop certain collateral, for example, the web site or any other marketing material.
CRM	Customer Relationship Management. CRM entails all aspects of interaction a company has with its customer, whether sales or service related. Computerisation and the proliferation of self-service channels like the web are leading to more of these
Cron Job	A cron job is a scheduled task under a UNIX-type operating system. cron is a daemon, or program that is always running. Its function is similar to the Windows Scheduler.
cross sell analysis	A cross sell analysis report is available, which shows products sold together within a shopping cart (or within different shopping carts during the same visit).
Cross-Border Card Fraud	Fraud perpetrated using a payment card, or card number, in a country other than the country of issue.
Cross-sell	SiteCatalyst report that shows the relationship between products in the same product string. For example, if a visitor purchased Item A, what other products were also in the cart at the time of purchase.
CSE	Initialism forcustom search engine.
CSV	Comma-separated values. File format where columns of data are separated by a selected comma.
CTR (Click through rate)	Used to convey the effectiveness of your online ad. (# clicks / # impressions)
CTR	Click through rate. Average number of clicks generated by the views of your online ad (impressions). Calculated by dividing the total number of clicks to the number of impressions, expressed as a percentage. This metric shows the effectiveness of your online ads.
currency (project default)	For each project, a default currency must be selected. That currency is used in generating all reports for that project. SeeAvailable Currenciesfor a list of valid currencies.
Currency Separators	The currency separator is used to separate dollars and cents in a monetary amount, for example, in U.S. dollars, the separator is the decimal (.); i.e. $203.78.
currencyCode Variable	The currencyCode variable is used to determine the conversion rate to be applied to revenue as it enters the SiteCatalyst databases.
Custom Conversion Insight Variable (eVar)	The eVar variable is placed in the Omniture code on selected web pages of your site. Its primary purpose is to segment conversion success metrics in SiteCatalyst custom reports. eVar variables can be visit-based and can function similarly to cookies on the site. Values passed into eVar variables follow the user for a predetermined period of time, based on configurations made in the Settings Tab.
Custom eVars	Open variables that can be used for custom purposes on your particular site. eVars are conversion-related variables only. See prop.
custom field	A variable which may be inflated with user defined data, such as age, sex, hotel name, error message etc.
Custom GA Codes	In some cases, the data that has to be gathered is dynamic and changes depending on the situation Our web analysts can work with your web development team to make sure that your reporting requirements are met despite of the constraints
Custom Insight Report	The set of reports that are related to the group of custom traffic variables (prop variables). Each prop report is a separate custom insight report.
Custom Link	Custom links are the links on your site that are configured to send data to SiteCatalyst. The linkType variable (or the second parameter in the tl() function) is used to identify the report in which the link name or URL will appear (Custom, Download or Exit Links Report).
Custom Pattern	Pattern in the Pattern Builder that enables you to build a custom pattern
custom report wizard	A Yahoo! Web Analytics wizard that helps you create your own reports by dragging and dropping group items and metrics.
Custom Search element	AWeb Elementfor custom search engines. You can embed a custom search engine into your website by copying and pasting code generated in theCustom Search elementsite.
custom search engine	The search engine that you create and customize usingGoogle Custom Search. The term is sometimes shortened as CSE.
Custom Search	A Google product that lets you create a search engine for your website, your blog, or a collection of websites.
Custom software solutions built by Outsource2india	The web analytics team at Outsource2india have helped several customers meet their specific web analytics requirements, by building custom website statistics software. The following are some of the custom web analytics software solutions that we have built at O2I:
Custom Traffic Insight Variable (prop)	The s.prop variable enables you to correlate custom data with specific trafficrelated events. s.prop variables are embedded in the SiteCatalyst code on each page of your website. Through s.prop variables, SiteCatalyst allows you to create custom reports, unique to your organization, industry, and business objectives.
Custom web analytics software solutions – Why opt for it?	Your organization might already be using one or two web analytics tools to help you with your web analytics requirements. However, you might be facing scenarios which your tools are not able to address. We can help you in such a situation, by providing you with a custom web analytics software solution or a custom website statistics software. With our expertise in both software and web analytics and a profound understanding in web analytics, we can build a custom website stats software that would perfectly meet your requirements.
Customer lifecycle analytics	Customer lifecycle analytics is a visitor-centric approach to measuring that falls under the umbrella of lifecycle marketing.[citation needed]Page views, clicks and other events (such as API calls, access to third-party services, etc.) are all tied to an individual visitor instead of being stored as separate data points. Customer lifecycle analytics attempts to connect all the data points into amarketing funnelthat can offer insights into visitor behavior andwebsite optimization.[citation needed]
Customer Loyalt	The Customer Loyalty Report reveals purchasing patterns of customers based on three categories of loyalty. A user who enters the site and makes a single purchase is considered a “new customer.” A user who makes their second and third purchases is considered a “return customer,” and a user who makes four or more purchases is considered a “loyal customer.”
Customer Segment	A sub-group of customers who share a particular attribute.
Customer Support Portal	Refers to the Help Section of SiteCatalyst where users can go to receive answers to their questions. Options include white papers, Knowledge Base, Live Chat, Ask a Question, SiteCatalyst User Guide, Implementation Manual, et cetera.
Customer	A Web site visitor who has completed a transaction on your site.
Customized Calendar	Calendar options in SiteCatalyst other than the Gregorian model. Options include the 4-4-5 and 4-5-4 calendar models, both of which are used as standards for the retail industry. Additionally, SiteCatalyst offers an option for a completely customizable calendar that you can set up yourself.
CVC or CVV	Card Verification Code (for MasterCard) or Card Verification Value (for Visa). Encrypted numeric value contained in the data on the magnetic stripe which can be checked to ensure that the information has not been altered in anyway.
CVM	Cardholder Verification Method. The means by which the presenter of the card may be identified as genuine, for example by a signature or PIN number.
Cyberspace	Coined by science fiction writer William Gibson, this is the name for the virtual space you are navigating when you are negotiating the web.
Daemon	A daemon is any program under a UNIX-type operating system that runs at all times. Common daemons are servers (such as Apache or an FTP server) and schedulers (such as cron).
Daily Return Visits	Report that displays the number of visitors to your web site more than once on a given day. A day is defined as the last 24-hour period.
Daily Unique Visitor	The number of unduplicated (counted only once) visitors to your website over the course of a single day. The visit for the daily unique visitor ends atmidnight for the time zone selected in the report suite. See Unique Visitors.
Dallas Data Center	Omnitures center in Dallas, Texas, where web analysis data is collected and stored.
DARTmail Integration	Integrated process between DoubleClicks DARTmail and Omniture SiteCatalyst. The process involves capturing e-mail addresses in SiteCatalyst and subsequently delivering e-mail lists via a Data Warehouse report to DARTmail.
DARTmail	DoubleClicks DARTmail application provides the ability to perform e-mail remarketing through the capture of user e-mail addresses.
Dashboard Accelerator	The Dashboard Accelerator stores a cached version of your dashboard for subsequent viewing. By caching (saving) a view of your dashboard for 24 hours, the Dashboard Accelerator is able to retrieve that view almost instantaneously.
Dashboard Player	Omnitures Dashboard Player is a Flash-based tool that displays the reportlets from selected dashboards in slideshow fashion on your screen.
Dashboard	A single screen that provides multiple metrics, reports or charts on the effectiveness of your business.
Data Bloc	Data blocks are used in Excel Integration. They contain data that is taken from SiteCatalyst and posted to an Excel worksheet.
Data Extract	Data extracts let you choose the parameters you will view on both the X- and Y-axes of the report, as well as the item by which the report will be filtered. For example, you could place products along the left side of the report, dates across the top, and a metric as the overall data filter. The Data Extracts are delivered reports, and are only available in CSV format.
Data Feed	A data feed is a compressed (zipped) file containing traffic data for one report suite. If data feeds are enabled for multiple report suites, a separate data file is sent for each report suite. Each compressed data feed file contains 13 tabdelimited text files, each carrying a .tsv filename extension. One file contains traffic data and the other files contain reference data.
Data Mining	The process of collecting and analysing data to provide information sorted in flexible ways.
Data Sources Activation Wizard	Part of the Data Sources Manager that enables you to activate or deactivate existing Data Sources. See Data Sources Manager.
Data Sources Manager	SiteCatalyst interface option that enables you to see and configure any data sources that are currently active and configure any new data sources.
Data Warehouse	is a logical collection of information gathered from many different operational databases used to create business intelligence that supports business analysis activities and decision-making tasks, primarily, a record of an enterprises past transactional and operational information, stored in a database designed to favor efficient data analysis and reporting.
Database Driven	Sawmill stores your statistics in an optimized database. This can be Sawmills own built-in high-performance database, or it can be a Microsoft SQL Server, Oracle, or MySQL database. This database can be incrementally updated as new log files arrive, and old data can be periodically expired from the database. Sawmill generates reports directly from the database; it does this so quickly (just a few seconds) that you can browse through your reports in real time, asking for new views of your data with every click, and receiving the information you want almost instantly.
Database VISTA Rule	VISTA rule that is used in conjunction with customer data on an Omniture database. Used as a time and a space saver when working with user data. The user loads a table of attributes, which are then used by a VISTA rule to populate data. The
Date Range	Google Analytics Date Range feature allows you to view report data by an arbitrary time frame, from one day up to more than a year. Most reports have the Date Range feature available.
date	In the context of the reports, this refers to the date when your a visitor accessed your website.
Day-parting	Segmentation of the time of day based on selected events.
Dayparting	The ability to specify different times of day or day of week for ad displays, as a way to target searchers more specifically. An option that limits serves of specified ads based on day and time factors.
Days Before First Purchase	SiteCatalyst report that shows the number of days that pass between the first time customers visit your site and when they finally make a purchase.
Days Before Last Purchase	SiteCatalyst report shows the most common number of days that pass between customers repeat purchases and allows you to view the time periods that contributed most to your sites key success metrics, such as revenue and orders.
Deep Linking	Linking that guides, directs and links a click-through searcher (or a search engine crawler) to a very specific and relevant product or category web page from search terms and PPC ads.
Default Page	The default page setting should be set to whatever the default (or index) page is in your sites directories. Usually, this will be index.html, but on Windows IIS servers, it is often Default.htm or index.htm. This information allows Google Analytics to reconcile log entries such as http://www.example.com/ and http://www.example.com/index.html, which are in fact the same page. Without the Default Page information entered correctly, these would be reported as two distinct pages. Only a single default page should be specified.
Demographics	The physical characteristics of human populations and segments of populations, often used to identify consumer markets. Demographics can include information such as age, gender, marital status, education, and geographic location. See alsopsychographics.
DHTML	Stands for Dynamic Hypertext Markup Language.
Dial-up	A low-speed, temporary connection to the internet using a modem over a standard, dialled telephone connection.
Digital Cash	A system that allows a person to pay for goods or services by transmitting a number from one computer to another. Like the serial numbers on real banknotes, digital cash numbers are unique. Each one is issued by a bank and represents a specific sum of money. Like real cash, it is anonymous and reusable, when digital cash is sent from a buyer to a vendor, no personal information is exchanged.
Digital Certificate	An attachment to an electronic message used for security purposes. The most common use of a digital certificate is to verify that a user sending a message is who he or she claims to be, and to provide the addressee with the means to encode a reply. An individual wishing to send an encrypted message applies for a digital certificate from a Certificate Authority (CA). The CA issues an encrypted digital certificate containing the applicants public key and a variety of other identification information. The CA makes its own public key readily available through print publicity or perhaps on the internet. The recipient of an encrypted message uses the CAs public key to decode the digital certificate attached to the message, verifies it as issued by the CA and then obtains the senders public key and identification information held within the certificate. With this information, the recipient can send an encrypted reply. The most widely used standard for digital certificates is X.509.
Digital Receipt	An electronic (email / PDF / printable webpage) acknowledgement of an order placed from a commerce-enabled website.
Digital Signal	All digital data is represented by one of two binary states off and on, corresponding to 0 or 1. Strings of 1s and 0s can be organised to represent text, images, audio and video information in efficient, non-degradable formats. Devices such as modems and codecs can translate digital to analogue signals and vice versa. See alsoAnalogue Signal.
Digital Signature	An electronic signature that can be used to authenticate the identity of the sender of an email or of the signer of an electronic document. It can also be used to ensure that the original content of the message or document has been conveyed
Digital Wallet	Encryption software that works like a physical wallet during electronic commerce transactions. A wallet can hold a users payment information, a digital certificate to identify the user, and shipping information to speed transactions. The consumer benefits because his or her information is encrypted against piracy and because some wallets will automatically input information at the merchants site. Most wallets reside on the users PC.
Dimensions	A general source of data that can be used to define various types of segments or counts and represents a fundamental dimension of visitor behavior or site dynamics. Examples include event and referrer.
Direct	These visitors came to your site by manually entering the URL of the page.
Directory Search	Also known as search directory. Refers to a directory of web sites contained in an engine that are categorized into topics. The main difference between a search directory and a search engine is in how the listings are obtained. A search directory relies on user input in order to categorize and include a web site. Additionally, a directory usually only includes higher-level pages of a domain.
Directory	A directory is a virtual container for holding computer files. It is not merely a list of items, as the name would imply, but rather a key building block of a computers storage architecture that actually contains files or other directories.
Disaster Backup And Recovery	Duplicated software and hardware that provide a backup copy of a computer system. Recovery procedures are predefined and used to re-establish the original environment in the event of a disaster.
Discover	Discover is an advance query and data exploration tool.
Display URL	The web page URL that one actually sees in a PPC text ad. Display URL usually appears as the last line in the ad; it may be a simplified or shortened path for the longer actual URL, which is not visible.
Distribution Network	A network of web sites (content publishers, ISPs) or search engines and their partner sites on which paid ads can be distributed. The network receives advertisements from the host search engine, paid for with a CPC or CPM model. For example, Googles advertising network includes not only the Google search site, but also searchers at AOL, Netscape and the New York Post online edition, among others.
DKI	Acronym for Dynamic Keyword Insertion, the insertion of the EXACT keywords a searcher included in his or her search request in the returned ad title or description. As an advertiser, you have bid on a table or cluster of these keyword variations, and DKI makes your ad listings more relevant to each searcher.
DMA	The Designated Market Area (DMA) Report segregates the United States into marketing areas. Internet Service Providers (ISPs) in each market area supply the American Registry of Internet Names (ARIN) with the IP addresses they use. Omniture partners with Digital Envoy to receive geosegmentation data that matches the IP address a website visitor has with the geographic city, state, zip code, and DMA for that IP address.
DMCA	Acronym for Digital Millennium Copyright Act. The Digital Millennium Copyright Act (DMCA) is a United States copyright law which.criminalizes production and dissemination of technology, devices, or services that are used to circumvent measures that control access to copyrighted works (commonly known as DRM), and criminalizes the act of circumventing an access control, even when there is no infringement of copyright itself. [Circumvention of controlled access includes unscrambling, copying, sharing, commercial recording or reverse engineering copyrighted entertainment or software.] It also heightens the penalties for copyright infringement on the Internet.Source: Wikipedia
DNS Entry	A web server configuration setting that maps IP addresses to a domain name. A CNAME is an example of a DNS Entry.
DNS Lookup – (Reverse DNS Lookup)	The process of converting a numeric IP address into a text name, for example, 63.212.171.4 is converted to www.googleanalytics.com.
DNS	A Domain Name System is an Internet addressing scheme that uses a group of names that are listed with dots (.) between them. See also domain.
document group	You can arbitrarily mark different pages within your site by tagging them with the setDocumentGroup method and defining a specific value for each group.The You can then generated reports using the document groups you have created for your website.
document name	The title of a web page (used for reporting purposes).
Domain Name System – (DNS)	An Internet addressing system that uses a group of names that are listed with dots (.) between them, working from the most specific to the most general group. In the United States , the top (most general) domains are network categories such as edu (education), com (commercial), and gov (government). In other countries, a two-letter abbreviation for the country is used, such as ca ( Canada ) and au ( Australia ).
Domain Name	The text name that corresponds to a numeric IP address of a computer on the Internet.
domain	Domains you are tracking.
Domain	Refers to a specific web site address.
Doorway/Landing/Gateway/Bridge/Jump Pages	A web page created expressly in hopes of ranking well for a term in a search engines organic/non-paid listings and which itself does not deliver much information to those viewing it. Instead, visitors will often see only some enticement on the doorway page leading them to other pages, or they may be seamlessly redirected to a real page within the existing web site. With cloaking, visitors may never see the doorway page at all. Several search engines have guidelines against doorway pages, though they are more commonly allowed in through paid inclusion programs.
doPlugins Function	JavaScript plug-ins are usually called by the doPlugins function, which is executed when the t() function is called in the Code to Paste.
Download Link	Computer files that can be accessed (downloaded) via a link on your web site.
Download	To retrieve a file or files from a remote machine to your local machine.
downloaded file name	The file names of the downloadable files on your website.
downloaded file URL	The URLs of the downloadable files on your website.
Downtime	Period during which a computer system or network is not functioning due to a power, hardware or software failure.
Dr. Teeth	CRM system used by Omniture to track and manage client settings and other configurations, as set in SiteCatalyst.
DRAM	Dynamic random-access memory chips, typically used for temporary storage in a PC. RAMs lose their memory information when the power is switched off.
Drilldown	Term used to define the act of moving from the general to the specific; enables the examination of the data underlying any summarized form of information.
DSL	Digital Subscriber Line. A DSL circuit uses standard telephone subscriber cabling, but in addition to voice information, transmit digital pulses at inaudible high frequencies that are capable of providing broadband-speed internet access to the home. DSL systems are the basis of todays broadband revolution. See alsoADSL.
DTI	Department of Trade and Industry. UK government body that regulates national business operations, including e-commerce.
Dynamic Account	The SiteCatalyst .JS file may be configured to automatically select a report suite ID. The .JS file will automatically send the image request to the report suite based on the URL. For example, if the URL is www.mysite.com, the image request is automatically sent to report suite A; if the URL is www.mysite1.com, the image request is automatically sent to report suite B.
Dynamic Landing Pages	Dynamic landing pages are web pages to which click-through searchers are sent that generate changeable (not static) pages with content specifically relevant to the keyword search. For example, if a user is looking for trucks, then a dynamic landing page with information and pictures on multiple models and, possibly, geographically localized dealerships might be served. The term truck would trigger a data dump into a web site template for all possible vehicles, that serves all truck-related information.
Dynamic Text (Insertion)	This is text, a keyword or ad copy that customizes search ads returned to a searcher by using parameters to insert the desired text somewhere in the title or ad. When the search query (for example, “hybrid cars) matches the defined parameter (for
dynamicAccountList Variable	The SiteCatalyst JavaScript file can be used to dynamically select a report suite to which it will send data. The dynamicAccountList variable contains the rules that will be used to determine the destination report suite.
dynamicAccountMatch Variable	The dynamicAccountMatch variable uses the DOM to object to retrieve the section of the URL that all rules in dynamicAccountList are applied to. This variable is only valid when dynamicAccountSelection is set to True.
DynamicObjectIDs Plug-in	To increase accuracy of ClickMap, you may assign a unique ID to every link on your site. This plug-in provides an automated way of assigning and customizing IDs. This is especially useful when your link URLs exceed 255 characters.
E-Book	A book thats been condensed into a special text format and available to download, usually at low cost. A special reader may be required to view e-books.
E-commerce	The buying and selling of goods and services, and the transfer of funds, through digital communications. Buying and selling over the internet, etc.
e-Retailer	Individual or business selling goods or providing paid-for services online.
E-zine / eZine	An electronically-published magazine, available on the web.
Easy To Install	Sawmill is extremely easy to install. For Windows or MacOS, just run the installer and launch the program. For UNIX, just tar/gunzip it and run the executable. Sawmill starts its built-in web server, and you’re ready to start using it immediately. Or
Easy To Use	Sawmill presents an intuitive web-based user interface, which leads you through every step of browsing your log files statistics. The New Profile Wizard asks questions when it needs information, so you only have to deal with the configuration options which are relevant to the task at hand. Statistics pages are full of links to related information, organized intuitively so you can get to the information you want with a minimal number of clicks. All aspects of Sawmills profiles are customizable through the web interface, so you can create new profiles, custom reports, custom filters, and much more, directly from Sawmills web interface.
EBPP	Electronic Bill Presentment and Payment. EBPP is the process by which companies bill customers and receive payments electronically over the internet.
eBusiness / E-Business / Ebusiness	SeeeCommerce.
ECN	Electronic Communications Network. Electronic trading systems that automatically match buyers’ and sellers’ orders.
Ecommerce	Conducting commercial transactions on the internet where goods, information or services are bought and sold.
Economic factors	Logfile analysis is almost always performed in-house. Page tagging can be performed in-house, but it is more often provided as a third-party service. The economic difference between these two models can also be a consideration for a company deciding which to purchase. Logfile analysis typically involves a one-off software purchase; however, some vendors are introducing maximum annual page views with additional costs to process additional information. In addition to commercial offerings, several open-source logfile analysis tools are available free of charge. For Logfile analysis you have to store and archive your own data, which often grows very large quickly. Although the cost of hardware to do this is minimal, the overhead for an IT department can be considerable. For Logfile analysis you need to maintain the software, including updates and security patches. Complex page tagging vendors charge a monthly fee based on volume i.e. number of pageviews per month collected. Which solution is cheaper to implement depends on the amount of technical expertise within the company, the vendor chosen, the amount of activity seen on the web sites, the depth and type of information sought, and the number of distinct web sites needing statistics. Regardless of the vendor solution or data collection method employed, the cost of web visitor analysis and interpretation should also be included. That is, the cost of turning raw data into actionable information. This can be from the use of third party consultants, the hiring of an experienced web analyst, or the training of a suitable in-house person. Acost-benefit analysiscan then be performed. For example, what revenue increase or cost savings can be gained by analysing the web visitor data?
eCPM	Acronym for Effective Cost Per Thousand, a hybrid Cost-Per-Click (CPC) auction calculated by multiplying the CPC times the click-through rate (CTR), and multiplying that by one thousand. (Represented by: (CPC x CTR) x 1000 = eCPM.) This monetization model is used by Google to rank site-targeted CPM ads (in the Google content network) against keyword-targeted CPC ads (Google AdWords PPC) in their hybrid auction.
EDI	Electronic Data Interchange. Transfer of data between different companies using private networks or the internet. EDI is becoming increasingly important as an easy mechanism for companies to buy, sell, and trade information. ANSI has approved a set of EDI standards known as the X.12 standards.
Editorial Review Process	A review process for potential advertiser listings conducted by search engines, which check to ensure relevancy and compliance with the engines editorial policy. This process could be automated using a spider to crawl ads or it could be human editorial ad review. Sometimes its a combination of both. Not all PPC Search Engines review listings.
effective query quota	The maximum number of queries allowed for a transferred GoogleSite Searchengine. It is calculated as the query quota specified in theSite Searchplan minus any query usage at the time of the ownership transfer. To learn more, seeTransfer Ownership.
Electronic Purse	A means of storing electronic funds locally, for use in ecommerce transactions.
eliminate search engine	A search engine that excludes certainsites; it searches across the web while excluding sites you had tagged witheliminate labelsfrom the search results.
eliminate	A mode for alabel.Sitestagged with eliminate labels are excluded from the results page of your search engine. To learn more, seeChanging the Ranking of Your Search Results.
Email Bounce	The number of e-mails that were sent but never reach the intended receiver.
Email Campaign	A marketing effort that uses email to drive new visitors to a website.
Email Marketing	Outsource2indias email marketing services can help you run full-fledged email campaigns to reach your target audience. We can collate a database of email addresses, create powerful marketing messages, shoot emails and track the results.
Email	Electronic mail, a means to send messages and files across the internet from one computer to another.
Emetrics Summit	International conference for web analytics professionals since 2002, headed by Jim Sterne.
Emoticons	Otherwise known as smileys. These are shorthand ways of expressing emotion in email messages by using punctuation marks such as for I am happy and for I am sad.
Encryption	A method of using complex mathematical algorithms to scramble data so that only the intended recipient can decode it.
End User	The final user of the computer software. The end user is the individual who uses the product after it has been fully developed and marketed.
End-To-End Verification	A payment scheme in which credit card information is passed from the customer directly to the bank for verification, i.e. bypassing the retailer.
End-User License Agreement	Legal document contained within SiteCatalyst that outlines the terms and conditions for using SiteCatalyst. For more information on SiteCatalyst terms and conditions, refer to the Omniture Master Service Agreement.
engager
entries	Number of entries / (visits).
entry document group	The document groups from where the visits to your website started.
entry domain	Domains of your website where your visitors started their visits.
entry page title	Page title where the visits to your website started.
entry page URL	Page URL where the visits to your website started.
Entry Page	First viewed page on the site.
Entry Pages	The page on which a visitor enters your web site. See Landing Page.
Entry Site Sections	The site section on which a visitor enters your web site.
EPoS	Electronic Point of Sale. A smart till or other device for processing card payments.
ERP	Enterprise Resource Planning. ERP is a business management system that integrates all facets of the business, including planning, manufacturing, sales, and marketing.
Error Code	Please see the definition of Status Code .
Error	Errors are defined as pages that visitors attempted to view, but that returned an error message instead. Often these errors occur because of broken links (links to pages that do not exist anymore) or when an unauthorized visitor attempts to access restricted pages (for example, if the visitor does not have a password to access the page).
Ethernet	The most common protocol used to network computers in a LAN. Ethernet can transfer about 10,000,000 bits-per-second and is compatible with almost all computer systems. See alsoLAN.
ETSI	European Telecommunications Standards Institute, the governing body of telecoms standards within Europe.
EVA	Electronic Virtual Assistant, like MS Words paperclip or XPs search dog an avatar-like creature that helps a user to understand an application.
eVar	See Custom Conversion Insight Variable.
Event Serialization	Event serialization is the process of removing duplicate events on each page view of the site with SiteCatalyst tags by using a unique identifier appended to the event.
Event	An activity that can be tracked by SiteCatalyst.
events Variable	The events variable is used to record common shopping cart success events as well as custom success events.
Events	Coming soon
Excel Integration	SiteCatalyst tool that you can use to extract data that is displayed in the SiteCatalyst interface, and display the data in an Excel spreadsheet.
Excel Workbook Library	The Excel Workbook Library is a repository for any workbooks that you create with Excel, including workbooks that contain data blocks.
Exclude	Exclude is a filter type available in the Google Analytics Filters configuration. If an Exclude filter is applied to a Profile, all log file lines (hits) that match the Exclude string will be discarded prior to the creation of the corresponding Google Analytics reports.
Excluded Keywords	For details, please see the entryNegative Keywords
exit link domain	Domains that your visitors went to after they left your website.
Exit Link	Any link that takes a visitor away from your site.
exit link	The URLs of the websites that your visitors went to after they left your website.
exit page URL	URL of the last page for visits to your website.
Exit Page	Last viewed page on the site.
Exit Pages	The page that contains the exit link. See Exit Link.
Exit Point	The page on which a visitor leaves your site.
Exit Site Sections	The site section on which a visitor leaves your site.
Expanded Mode	SiteCatalyst setting that displays all reporting submenus without having to click them to open them.
Expiration Trigger	Sets the lifetime of a variable value by letting you tell the system when to expire the variables value. Until a value expires, it will be used as a modifier to the purchase events. Expiration triggers can be dates, time periods or conversion events.
Expiration	A pre-defined time period at which a success event terminates.
Expire	Termination of variable values. Until a value expires, it will be used as a modifier to purchase events.
Extensive Documentation	The manual for Sawmill is built right into the web interface program, so its always at your fingertips as you use the program. Throughout the HTML interface, there are links to relevant sections of the online documentation, and wherever a configuration option is mentioned, or a value is requested for it, there is a link to that options documentation page. You can browse the documentation by running any copy of Sawmill.
external annotations	A stand-aloneannotation file. For contrast, seeinline annotations. Most annotations are external annotations.
Extranet	An extension of an organisations internal network to other locations or users external to the enterprise.
Eye Tracking Studies	Studies by Google, Marketing Sherpa and Poynter Institute using Eyetools technology to track the eye movements of web page readers, in order to understand reading and click-through patterns.
F.F.A	Stands for Free for All link pages. These are not search engines or directories. They are, for the most part, pages that simply take URL submissions that usually stay active for a period of time. A submission is placed at the top of their list and then moved down, and eventually out, as other submissions are made. These are seen as outdated and were used in an attempt to artificially inflate link popularity.
F.T.P	Stands for File Transfer Protocol.
facet item	Describes arefinement labeland its title. The title is visible to users and appears in the search results page.
facet label	A refinement label that users can see in the search results page.
facet	A conceptual grouping of relatedrefinement labels. A facet is composed offacet items.
Failover	Backend systems at Omniture that mitigate potential problems at several important levels without affecting front-end performance.
Fallout	Metric that displays the point at which a visitor abandons a site, action, path etc.
FAQ	Frequently Asked Questions, a listing of common enquiries and responses on a particular subject, typically as a section of a website.
Favorite	A means of creating an index of local and internet documents, files and websites so that they are easily accessible.
File Downloads	SiteCatalyst report that counts the number of times a download link is clicked on a page.
File Type	A File Type is a designation, usually in the form of an extension (such as .gif or .jpeg), given to a file to describe its function or the software that is required to act upon it. More generally, file types can be grouped into image file types (such as .gif, .png, .jpeg), text file types (such as .doc or .txt), and many others.
File	Any named collection of data stored in a physical location on a PC or the web. Can be text, graphic, sound, video etc.
Filter Field	A filter field is the number of the field on which to apply a filter. In a log file line, or hit, there are several distinct fields, each one holding a different piece of data. To apply a filter to a log file, you must first identify which field
Filter Name	The Filter Name is intended to be a descriptive title for a filter. It is used only as an organizational aid, and may contain spaces.
Filter Pattern	A Filter Pattern is the actual text string against which Google Analytics will attempt to match log file lines. If a match is found, the log line (or hit) will be either excluded or included, depending on the Filter Type. Patterns can be specific text to match or use wildcards as part of a regular expression. NOTE: Filter Patterns are case-sensitive, so to filter out the Googlebot spider, for instance, use Googlebot, not googlebot (do not use quotes).
filter search engine	A search engine that includes only sites tagged with filter labels and excludes all other sites.
Filter to Apply	The filter to apply is the actual text string to be used to either filter in or filter out content. The Filter to Apply can be either a plain text string or a regular expression.
Filter Type	A filter must be of one of two filter types, either an Include (filter in), or Exclude (filter out). If an inclusive filter (Include) is used, only hits containing the filter string will be represented in the Google Analytics report. If an exclusive filter (Exclude) is used, no hits containing the filter string will be represented in the Google Analytics report.
Filter	A filter is a text string or regular expression that is used to either exclude certain hits or only include certain hits from a Google Analytics report. Filters are commonly used to filter out certain content, such as internal company traffic or javascript libraries, or to set up special reports for only certain types of content, like a subsection of a web site.
Filtering	A means of narrowing the scope of a report or view by specifying ranges or types of data to include or exclude.
Firewall	A security device placed on a LAN (local area network) to protect it from Internet intruders. This can be a special kind of hardware router, a piece of software, or both.
First Party Cookie	These cookies are placed by the websites unlike third party cookies placed by vendors. First party cookies are understood to be secure and reliable.
First Time Sessions	The number of times unique visitors came to your website during a specified time period, not having visited before that period. These visitors are identified by cookies.
First Time Unique Visitor	The number of Unique Visitors to your website that had not visited prior to the time frame being analyzed.
first time	First time visits. The first time a visitor comes to your website, a one-year, first-party, persistent cookie is set in the visitors browser, which identifies them as a first time visitor.
First Visit / First Session	A visit from a visitor who has not made any previous visits.
First-party Cookies	First party cookies are left on your machine by the domain that you are currently viewing. See Cookies.
Flash Graph	An interactive graphical representation in SiteCatalyst that displays metrics on mouse-over. See Flash.
Flash subversion	The subversions of Flash that your visitors use.
Flash support	Whether or not a visitors browser enables Flash.
Flash Tracking	Normal site tracking measures disregard the presence of Flash We can tag your flash pages so that important functionalities are tracked
Flash version	The version of Flash that a visitor uses.
Flash	A graphics animation program, written and marketed by Macromedia, which uses vector graphics. The resulting files, sometimes called Flash files, may appear in a web page to view in a web browser, or standalone players Flash players may play them. Flash files occur most commonly in animated advertisements on web pages and rich-media web sites.
Floating User License	Single license that multiple users can use, but only one user can access the application at a time.
Floor Limit	A limit agreed between the merchant and acquiring bank for each sale, above which authorisation must be obtained by the merchant.
Follow Pattern	Pattern in the Pattern Builder that enables you to analyze the page(s) following a selected page.
Forecast	Formula that lets you predict what your traffic is most likely to do next.
Form Abandonment	The ability to track form usage and abandonment on your site, or in other words, when visitors navigate away from your site after partially filling out a form.
Form Analysis Plug-in	The Form Analysis Plug-in tracks when visitors use forms on your site. This plug-in will track abandonment, successful submission, and errors when dealing with forms. Examples of forms elements include text boxes, radio buttons, drop-down boxes, et cetera.
Form	An HTML page which passes variables back to the server. These pages are used to gather information from users.
Forward	To send on or redirect an incoming message to another location or recipient.
Frame	A rectangular region within the browser window that displays a web page alongside other pages in other frames.
Freeware	A software application that can be downloaded, used and distributed free of charge. See alsoShareware.
Frequency / Session per Unique	Frequency measures how often visitors come to a website. It is calculated by dividing the total number of sessions (or visits) by the total number of unique visitors. Sometimes it is used to measure the loyalty of your audience.
Frequency	The number of times a visitor has visited a site during a reporting period. Average Frequency is the average of frequencies of all the visitors during the reporting period. Frequency is a retention metric and is part ofRFM(recency, frequency, monetary) analysis. See alsorecencyand latency.
Friendly Name	A more user-friendly name that is displayed in place of a URL in your SiteCatalyst reports.
FTP	(File Transfer Protocol) The basic method for copying a file from one computer to another through the Internet.
Full Paths	SiteCatalyst report that displays the entire visit path through a web site that visitors most commonly take.
Funnel Analysis	If you engage your customers in a step wise procedure, you will notice that people will drop off at each step It is your responsibility, to make is easier for your customer to accomplish what he set out to do By analyzing and experimenting you can reap the rewards of increased customer turnover
Funnel	SeeConversion Funnel.
G.U.I	Stands for Graphical User Interface. Means a visual representation of the functional code. Or, is a way for the average web user to interface with a database, program, etc.
gadget (Google)	A miniature web application that can be placed on any page on the web, including your iGoogle page. For more information, see theGadgets site.
Gantt View	The Gantt view provides a quick view of when your site campaigns began and when they ended (flight date), and how they affected your sites success metrics. You can see the day each campaign began as well as the day the campaign ended.
Gateway page	See Doorway Page.
Gateway	Hardware or software that acts as a mediator between two distinct protocols and helps in the transfer of information between systems.
Gauge Reportlet	Gauge reportlets show the performance of a specific metric according to a custom-defined scale. You have the option to select a dial, bar, or bulb visualization type, set the scale (thresholds) you wish to use, and define otherreport details.
Geo-Targeting	The geographic location of the searcher. Geo-targeting allows you to specify where your ads will or wont be shown based on the searchers location, enabling more localized and personalized results.
Geolocation of visitors	WithIP geolocation, it is possible to track visitors location. Using IP geolocation database or API, visitors can be geolocated to city, region or country level[6]. IP Intelligence, or Internet Protocol (IP) Intelligence, is a technology that maps the Internet and catalogues IP addresses by parameters such as geographic location (country, region, state, city and postcode), connection type, Internet Service Provider (ISP), proxy information, and more. The first generation of IP Intelligence was referred to asgeotargetingorgeolocationtechnology. This information is used by businesses for online audience segmentation in applications suchonline advertising,behavioral targeting, content localization (orwebsite localization),digital rights management,personalization, online fraud detection, geographic rights management, localized search, enhanced analytics, global traffic management, and content distribution.
GeoSegmentation	GeoSegmentation reports help you understand the geographic dynamics of your Web audience, including the countries, states and cities from which they are browsing.
GET Method	The GET method is a way of passing parameters of an HTTP request from the browser to the server. This method puts the parameters, usually separated by special characters such as ampersands (&), in the URL itself, which is viewable to the
getAndPersistValue Plug-in	getAndPersistValue is used to force a value to be set in a variable on every page for H days or until the end of the session. A common use is to see how many page views a campaign generates after a click-through, which enables you to easily see the most common pages for each campaign.
getCartOpen Plug-in	This plug-in identifies the first time a product is added to the cart (first scAdd). The Cart Open event can be used to compare carts initialized to carts completed (orders).
getCookieCount Plug-in	Omniture plug-in that determines the total number of cookies users have on your domain.
getCookieSize Plug-in	Omniture plug-in that determines the average size of cookies on your domain.
getDaysSinceLastVisit Plug-in	Omniture plug-in that returns the number of days since the visitors last visit to your site.
getNewRepeat Plug-in	Omniture plug-in that provides breakdown segmentation of new and repeat visitors.
getPBD Plug-in	Popup Blocker Detection Plug-in. Helps determine what percent of your uses have pop-up windows blocked. This can be evaluated on a user basis or page basis.
getPreviousPage Plug-in	This plug-in will capture the page name the user saw last. Commonly used to correlate to internal search terms.
getQueryParam Plug-in	getQueryParam returns the value of the query string parameter found in the current URL. If no query string parameter is found with that value, an empty string is returned.
getTimeToComplete Plug-in	This plug-in evaluates the time it takes a user to complete a process. It is commonly used to determine time between checkout and purchase.
getValOnce Plug-in	getValOnce is used to force a variable to be populated only once within a single session or time period. The most common reason for doing this is to keep campaign click-throughs from being inflated.
getVisitStart Plug-in	Omniture plug-in that determines when a user’s visit on your site begins.
GIF	A graphics file type Graphics Interchange Format a compressed, bitmapped format often used on the web because of its good quality/compression ratio when used on certain image types, particularly those with large flat areas of color.
Gigabyte	1024 Megabytes.
GIS	Geographic Information Systems, applications and data related to transport, postcode and geo-demographic information.
Global Report Suite	A report suite that can be used to report on the collective actions of all the children report suites that roll up to it. For example, www.mysite.com (global report suite) can be used to report on the actions of www.mysite1.com, www.mysite2.com, and www.mysite3.com.
GMT (Greenwich Mean Time)	The international time zone standard. GMT is five hours ahead of Eastern Standard Time (EST). For example, 1:00 A.M. EST would be 6:00 A.M. GMT.
Goal Conversion Rate	In the context of Campaign Tracking, the percentage of sessions on a site that result in a conversion goal being reached on that site.
Goal Setting	Identify your main success points, such as customer driven goals, sales driven goals etc Helps you understand which parts of your website is a success and which parts are falling. With such information you can stop trouble before it comes. For example, keeping a close eye on help metrics and customer support can help you
Googlebot	A program that fetches or crawls billions of pages on the web. Crawling programs are also known as robots, bots, or spiders. Googlebot uses an algorithm that determines which sites to crawl, how often, and how many pages to fetch from each site.
Granularity	The level of detail at which you are viewing your data. Options include daily, weekly, monthly, and yearly.
Graphic User Interface	(GUI) Pronounced gooey. A method of controlling software using on-screen icons, menus, dialog boxes, and objects that can be moved or resized, usually with a pointing device such as a mouse.
Group Membership	Group membership enables a group of users to have access to various functions within SiteCatalyst.
group	A reporting dimension that is typically displayed on the left in a tabular report. For instance, search phrases is a group.
GRP	Gross Rating Point is the percentage of the target audience reached by an ad.
gTLDs	Generic Top Level Domains, refer to the part of an internet domain name to the extreme right, which denotes a business (.com), a non-profit organisation (.org), a network provider (.net), or a country code (.uk, .es, .ie etc.) New gTLDs are being added all the time .gov, .biz, .tv and so on.
H Code	JavaScript code version H used by Omniture in the SiteCatalyst Code to Paste.
Hacker	An individual who intentionally breaks into someone elses computer system or network, to bypass security measures and access private data.
Hardware	Computer machinery and equipment such as hard disks, printers, monitors, network cards etc. See alsoSoftware.
HDSL	High Speed Digital Subscriber Line, similar to ADSL, but operating at higher transfer speeds in both directions. See alsoDSL.
Head Terms	Search terms that are short, popular and straightforward; e.g., helicopter skiing. These short terms are called head terms based on a bell-curve distribution of keyword usage that displays the high numbers of most-used terms at the head end of the bell curve graph. See also Tail Terms.
Header	Identifying information that precedes and describes the attached content. Emails messages contain an automatically-generated header that describes the when the message was sent, to whom, from whom, and which path it took through the Internet.
Hello world!	Welcome to WordPress. This is your first post. Edit or delete it, then start blogging!
hexing, hexed, hex	A form of encoding data from a binary to a string representation where each byte is represented by a hexadecimal value in characters. For example: {0,10,20,15,30,40,50} hexed is 000A140F1E2832.
Hidden text	(Also known as Invisible text.) Text that is visible to the search engines but hidden to a user. It is traditionally accomplished by coloring a block of HTML text the same color as the background color of the page. More creative methods have also been employed to create the same effect while making it more difficult for the search engines to detect or filter it. It is primarily used for the purpose of including extra keywords in the page without distorting the aesthetics of the page. Most search engines penalize or ignore URLs from web sites that use this practice.
Hierarchy	A hierarchy (concept name) is a certain group in a web site. Many users group their web site into sections and subsections. The Hierarchies Section in SiteCatalyst then allows them to view reports on these individual sections.
hierN Variable	The hierarchy variable (JavaScript file name) is used to determine the location of a page in your sites hierarchy. This variable is most useful for sites that have more than three levels in the site structure. For example, a media site may have four levels to the Sports section: Sports, Local Sports, Baseball, Red Sox. If someone visits the Baseball page, then Sports, Local Sports and Baseball will all reflect that visit.
Highly Configurable	Sawmill is highly configurable using a large set of configuration options. These options can be configured through the graphical user interface from any web browser. The options let you choose which reports are available (or you can create your own custom reports), what types of information are tracked, which log entries are filtered out, what the statistics look like, and much, much more.
History	A record of a users progress while using an application. Web browsers track the history of sites visited, pages viewed etc.
Hit Counter	A piece of web server software that racks up the number of file requests relating to a webpage or site.
Hit	The request or retrieval of any item located within a web page. For example, if a user enters a web page with 5 pictures on it, it would be counted as 6 hits. One hit is counted for the web page itself, and another 5 hits count for the pictures.
Hits	A generally misused and not particular instructive metric defined as any request for a file, including images on a page. A plain web page with 4 images would generate 5 hits when visited.
Home Page / Homepage	Either the page your web browser automatically loads when you launch it, or the initial welcome page of a website.
Home Page Announcement	A Home Page Announcement is an area of the SiteCatalyst Home Page to which system administrators can post messages and general updates to all users.
Home Page URL	The local path or Internet URL to the default page of the web site for which Webtrends reports will be generated.
Host	A network computer that is a repository for services available to other computers on the network.
Hosting	The activity of housing, serving and maintaining files for one or more internet resources. Hosting is part of the core business of ISPs.
How Accurate is Google Adword Billing?	Many people believe that Google over-charge for Adwords. I investigated this in great depth, including interviews with senior Google staff. The results may surprise you. Google are convinced they are absolutely, always 100% accurate. In fact, they are so accurate they dont even have a mechanism for investigating possible errors they have a re-education team whose job is to teach you not to complain. Google Adwords Billing Accurate or Arrogant?
How to Assess Website Performance	Describes the framework you need to assess the performance of your website. It provides you with a checklist to ensure you are covering everything. Framework for Performance Assessment
htaccess file	A file with one or more configuration directives placed in a web site document directory. The directives apply to that directory and all sub-directories.
HTML Editors	Word processor-like software applications that are used to design and construct web documents.
HTML	Hyper Text Markup Language is used to write documents for the World Wide Web and to specify hypertext links between related objects and documents.
HTTP (Hypertext Transfer Protocol)	A protocol that determines how Web servers and Web browsers should respond to various commands. For example, when you enter a URL in your browser, your browser sends an HTTP command to the Web server instructing it to retrieve the requested Web page.
HTTP Referrer Data	A program included in most web analytics packages that analyzes and reports the source of traffic to the users web site. The HTTP referrer allows webmasters, site owners and PPC advertisers to uncover new audiences or sites to target or to calculate conversions and ROI for future ad campaigns.
HTTP	Stands for Hypertext Transfer Protocol.
HTTPS	Stands for Hypertext Transfer Protocol Secure.
Hybrid methods	Some companies are now producing programs that collect data through both logfiles and page tagging. By using a hybrid method, they aim to produce more accurate statistics than either method on its own. The first Hybrid solution was produced in 1998 by Rufus Evison, who then spun the product out to create a company based upon the increased accuracy of hybrid methods[2][5].
Hypertext	Text that contains links to other documents by means of clicking on a hot spot to load and view the related document
IAB	Interactive Advertising Bureau (http://www.iab.net)
IASAPI_rewrite	ISAPI_rewrite is a powerful URL manipulation engine based on regular expressions. It acts mostly like Apaches mod_rewrite, but is designed specifically for Microsofts Internet Information Server (IIS). ISAPI_rewrite is an ISAPI filter written in pure C/C++ so it is extremely fast. ISAPI_rewrite gives you the freedom to go beyond the standard URL schemes and develop your own scheme.Source
IE	Internet Explorer. Microsofts web browser application, currently the market leader.
iframe	An HTML structure that lets you embed an HTML document within an HTML page.
iGoogle	Your personalized Google page. You can add gadgets, news, photos, and other feeds from across the web to your page.
IHCF	Industry Hot Card File. List of lost or stolen cards, available for checking by merchants.
IIS	Microsoft Internet Information Server, or IIS as its commonly called, is a popular web server software system for Windows operating systems. It is currently unavailable for other operating systems. For more information, see Microsoft.com .
Image Request	An image request, also known as a clear.gif or web beacon, is a transparent graphic image, no larger than 1x1 pixel, usually placed on a web site or in an email to track visitor behavior. SiteCatalysts web beacon points to a server, known as
Implementation Consultant	The Omniture Implementation Consultant is an Omniture representative that works directly with the customer to implement SiteCatalyst code to ensure that SiteCatalyst is configured properly and successfully.
Implementation	A term used when referring to the setup and configuration of SiteCatalyst.
Impression	A display, on a search engine or other source, of a referral link or advertisement.
Impressions	Number of times your ad is displayed.
Improving Online Sales without Increasing Costs	Case study showing how I quadrupled sales forMotoreasywithout increasing their marketing spend. Improving Online Sales without Increasing Costs
Improving website performance	I am one of the worlds most published independent web analysts, with clients in the USA and Europe. I analyze the behaviour of visitors in your site. I cross-reference this with your online marketing programs. By combining this with a solid understanding of the psychometrics of web usage, I identify specific changes which will improve the percentage of visitors who do what you want (such as buy stuff). These changes may involve alteration to ad copy, or the prices paid for ads, or they may involve changes to the web site, such as moving components to different points on a page, adding new pages, or changing copy. I then co-ordinate with your design team to make this happen. If your site is merely a brochure to assist a traditional sales channel, you probably dont need this. Site performance improvement is for those companies who want to put their website at the centre of their business strategy. Throughout this process I keep you as involved as you want.
Inaccuracies in Website Measurement	Recording what happens on websites is NEVER 100% accurate. These two articles explain where the errors occur, why and what you can do about it. Inaccuracies in Website Measurement Caused by Internet Technology Inaccuracies in Website Measurement in Software
Inbound/Back Link	A text or graphical hyperlink from one site to another. Google and other search engines algorithms consider a sites popularity based on the quality and quantity of inbound links from relevant third party sites to help determine search positioning.
Include	Include is a filter type available in the Google Analytics Filters configuration. If an Include filter is applied to a Profile, only those log file lines (hits) that match the Include will be used in the creation of the corresponding Google Analytics reports.
Index	The collection of information (contained in a large database) a search engine has that searchers can query against. With crawler-based search engines, the index is typically copies of all the web pages they have found from crawling the web. With human-powered directories, the index contains the summaries of all web sites that have been categorized.
Indexability	Also known as crawlability and spiderability. Indexability refers to the potential of a web site or its contents to be crawled or indexed by a search engine. If a site is not indexable, or if a site has reduced indexability, it has difficulties getting its URLs included.
Individual	Activity of a single Web visitor for a defined period of time.
Information Security	All-encompassing term that refers to the security of the information systems that are used and the data that is processed.
Initial Session	This is the first Session conducted by a trackable Unique Visitor during the current Date Range . This value is equal to the total number of Unique Visitors during the same Date Range (each Unique Visitor has at least one session). This value is provided in contrast to Repeat Sessions.
inline annotations	Annotations that are embedded inside a context file. For contrast, seeexternal annotations. A context file with inline annotations has two sections: theCustomSearchEnginesection, which houses the context or search engine specification, and theAnnotationssection, which houses the annotations or sites information. You can use files of this format only if you are hosting them from your website.
Insight Variable	See Custom Conversion Insight Variable or Custom Conversion Traffic Variable.
Instance	An instance relates to the number of times that a unique event occurs in SiteCatalyst. In Conversion reports, an instance represents the number of times that a shown value was passed to an eVar. For example, if 10,000 visitors spend between 10 and 15 minutes on a specific page in your website, you would have 10,000 instances of people visiting for that duration. Additionally, SiteCatalyst contains the Instance Metric, which is used in several places. For example, the Referrers Report and Referring Domains Report show the Instance Metric, which reports the number of click-throughsfrom the referring page or domain.
Integrating Offline Data	The customers for a lot of companies dont exist only in the online world. There are a lot of interactions that take place in the real world, such as payments, package delivers or telephone interactions If offline data is not taken into account it
Intelligent Detection Systems	Computer systems developed by the banking industry to help identify fraudulent card use. Also known as knowledge-based systems and neural networks.
Interactive television (ITV)	Interactive Television is television with interactive content and enhancements. Interactive television provides richer entertainment and more information about what is showing. Literally it combines traditional TV watching with the interactivity of the Internet and personal computer.
Interactive TV	Interactive TV gives the viewer more control over their television viewing in a variety of ways, by sending information to the broadcaster, to order a pay-per-view programme, buy or bid for sale items, select a different camera viewpoint, place a bet on a sports event etc.
internal campaign	Internal campaigns are campaigns designed to elicit a response from your web site visitors. You can identify internal campaigns and use them to test the effectiveness of your banners and calls toaction(e.g., subscription, promotional sale, request to rate your content, etc.). Internal campaigns do not conflict with regular (external) campaigns.
internal conversion rate	A metric is defined for particular actions, and is calculated by dividing the number of visits that included a particular action by the total number of visits. An alternative way of defining this metric is action participation divided by the number of visits.
Internal Web Analytics Processes	In many organisations the problem is not the lack of web analytics software, but the organisational structures around them. Does the information flow from those gathering it to those who can action it? In a timely fashion? In a format they can use? Are decisions getting implemented? I understand that technology is useless if it isnt supported by appropriate business processes and structures. I can help you develop your organisations internal web analytics processes. Obviously, every organisations needs are unique, socontact usto discuss your requirements.
INTERNATIONAL experience	Most of our clients are global leaders in their respective industries. Having analyzed the complex web analytics needs of huge organizations, we know what it takes to generate actionable insights when there are multiple websites, data streams, email campaigns, social media, and many other offline channels that have an impact on the data.
Internet Average	The average value for a given statistic, taken across thousands of business Web sites. It provides a baseline against which site data is compared. SiteCatalyst uses the Internet Average for all reports in the Technology Section, as well as the Search Engines, Countries, Languages, and Domains Reports.
Internet	The Internet is the publicly accessible global system of interconnected computer networks that transmit data via a standardized Internet Protocol. See alsoWorld Wide Web.
IP (Internet Protocol) Address	This is a unique string of numbers that identifies a computer or server on the Internet. These numbers are normally shown in groups separated by periods, for example, 123.45.67.255. Hosting accounts for websites can have either a shared or unique IP address.
IP address jinx	Unlike traditional web analysis where IP address may help you track down the location and identification of the web users, most IP addresses for mobile internet users will show you the internet gateway address of the mobile operator.
IP Address Lookup	The process of determining a unique Internet Protocol (IP) address. DNS stuff is one free program to look up an IP address (www.dnsstuff.com).
IP Address	An identifier for a computer or device on a TCP/IP network. Networks using the TCP/IP protocol route messages based on the IP address of the destination. The format of an IP address is a numeric address written as four numbers separated by periods. Each number ranges from 0 to 255.
IP	Internet Protocol is a standard used for communicating data.
IPAddress	Dedicated and shared IPs. (An IP address is) an identifier for a computer or device on a TCP/IP network. Networks using the TCP/IP protocol route messages based on the IP address of the destination. The format of an IP address is a 32-bit numeric address, written as four numbers separated by periods. Each number can be zero to 255. For example, 1.160.10.240 could be an IP address.Source: Webopedia.(Added definition) An IP Address can be dedicated for one web site or shared by multiple web sites.
IPTV	Acronym for Internet Protocol Television, which delivers digital television service using the Internet Protocol over a network. IPTV delivery may be through a high capacity, high speed broadband connection. Compared to traditional broadcast and cable television, IPTV may offer new venues for PPC search advertisers through program interfaces and stored individual preferences.Source:Wikipedia
IRC	Internet Relay Chat. A system whereby users can connect via the internet to an IRC server to exchange real-time text messages and files.
ISDN	Integrated Services Digital Network. An early method of dial-up digital network access. ISDN provides two 64bps data channels and a 16bps control channel for voice and data communications over a telephone line.
ISP (Internet Service Provider)	Companies that provide end-user Internet access via dial-up, cable modem or other accounts. Some ISPs also provide other services such as Web site hosting, domain name registration, etc.
ISP	Internet Service Provider. A company which provides other companies or individuals with access to, or presence on, the Internet. Most ISPs are also Internet Access Providers extra services include help with design, creation and administration of
ISPA	The UK Association of Internet Service Providers.
iTV	SeeInteractive TV.
Java support	Whether your visitors browsers support the Java environment or not.
Java	An object-oriented programming language invented by Sun Microsystems. Java is designed to run on any type of computer hardware through an intermediary layer called a virtual machine, which translates Java instructions into native code for that particular computer.
JavaScript / Java	A powerful programming language developed by Sun Microsystems, used extensively in web applications to do relatively advanced things on webpages for example, changing the colour of an image whenever the cursor moves over it.
JavaScript Debugger	The JavaScript-based debugger is a SiteCatalyst tool that allows you to view the parameters sent in an image request.
JavaScript status	Whether JavaScript is enabled on your visitors browsers or not.
JavaScript tag	Seepage tag.
JavaScript version	The versions of JavaScript that your visitors use.
JavaScript	JavaScript is a scripting language based on prototype-based programming. It is used on a web site as client-side JavaScript, and also to enable scripting access to objects in other applications.
JDK	Java Development Kit. A software development package that implements the basic set of tools needed to write, test and debug Java applications and applets.
JPEG	A Joint Photographic Experts Group file format is a commonly used file type for photographic images, especially on the web.
JSON	JavaScript Object Notation. A format for exchanging data that are structured as a collection of name-value pairs or an ordered list of values. To learn more, see theJSON.org documentation.
KEI	Keyword Effectiveness Index. The higher the KEI, the more popular thekeywordsare and the less competition they have, which means they have a better chance of getting to the top of organic searches. Various different formulae are used to calculate KEI, although the most common is KEI=S2/C where S is the number of Searches for that keyword in a given period and C is the number of individual webpages Competing for that keyword (i.e. the number of pages returned when searching for the keyword on a nominatedSearch Engine).
Key Performance Indicator (KPI)	A key performance indicator (KPI) is a measurement that reflects the success elements of your business and varies depending on the organization. For example, an important key performance indicator for an online shop might be the number of sales. A KPI for an online help desk might be the number of customers assisted per month. According to Web Analytics Association, AKPI can be either a countor a ratio, it is frequently a ratio. While basic counts and ratios can beused by all Website types, a KPI is infused with business strategy hencethe term, Key and therefore the set of appropriate KPIs typicallydiffers between site and process types. The crucial parameters showing the health of the website and success of marketing strategies.
Keyword / Keyword Phrase	A specific word or combination of words that a searcher might type into a search field. Includes generic, category keywords; industry-specific terms; product brands; common misspellings and expanded variations (calledKeyword Stemming), or multiple words (calledLong Tailfor their lower CTRs but sometimes better conversion rates). All might be entered as a search query. For example, someone looking to buy coffee mugs might use the keyword phrase ceramic coffee mugs. Also, keywords which trigger ad network and contextual network ad serves are the auction components on which PPC advertisers bid for all Ad Groups/Orders and campaigns.
Keyword Density	The number of times a keyword or keyword phrase is used in the body of a page. This is a percentage value determined by the number of words on the page, as opposed to the number of times the specific keyword appears within it. In general, the higher the number of times a keyword appears in a page, the higher its density. However, there is a natural limit to this. One can not add a keyword on a page too many times as the page will lose its natural.
Keyword Stemming	To return to the root or stem of a word and build additional words by adding a prefix or suffix, or using pluralization. The word can expand in either direction and even add words, increasing the number of variable options.
Keyword Stuffing	Generally refers to the act of adding an inordinate number of keyword terms into the HTML or tags of a web page.
Keyword Tag	Refers to the META keywords tag within a web page. This tag is meant to hold approximately 8 10 keywords or keyword phrases, separated by commas. These phrases should be either misspellings of the main page topic, or terms that directly reflect the content on the page on which they appear. Keyword tags are sometimes used for internal search results as well as viewed by search engines.
Keyword Targeting	Displaying Pay Per Click search ads on publisher sites across the Web (see also Contextual Networks) that contain the keywords in a context advertisers Ad Group.
Keyword	A keyword is a database index entry that identifies a specific record or document. Keyword searching is the most common form of text search on the web. Most search engines do their text query and retrieval using keywords. Unless the author of the web document specifies the keywords for her document (this is possible by using meta tags), its up to the search engine to determine them. Essentially, this means that search engines pull out and index words that are believed to be significant. Words that are mentioned towards the top of a document and words that are repeated several times throughout the document are more likely to be deemed important.
Kilobyte	1024 bytes.
Knowledge Base	The Knowledge Base is a SiteCatalyst tool that contains frequently asked questions about Omniture products and general web analytics questions. AllSiteCatalyst users can access the Knowledge Base.
KPI	Key performance indicator defined by user that is important in determining the success of an initiative.
L-Commerce	Location-based commerce, relating to business conducted at a physical location.
label	Determines how sites should be treated and gives you the means to fine-tune search results. A label has amodewhich tells Custom Search what do with the sites, such as whether to exclude, promote, or demote them. How much a site is promoted or demoted depends on the weights that you apply to the labels. The two kinds of labels aresearch engine labels, which runs in the background, andrefinement labels, which are exposed to your users. To learn more, seeChanging the Ranking of Your Search ResultsandHelping Your Users Refine Their Searches.
LAN	Local Area Network. A computer network limited to the immediate area, usually contained in the same building.
Landing Page / Destination Page	The web page at which a searcher arrives after clicking on an ad. When creating a PPC ad, the advertiser displays a URL (and specifies the exact page URL in the code) on which the searcher will land after clicking on an ad in the SERP. Landing pages are also known as where the deal is closed, as it is landing page actions that determine an advertisers conversion rate success.
Landing page	The first page that a user views during a session. This is also known as the entrance page.
Landing Page	The first viewed page on a visitors path through a site. Often, the websites home page will be the page that most visitors land on, but specially built landing pages for advertising campaigns can be more effective in helping the conversion process. Landing pages can be stand-alone with no connections to your main website. They can also be specialised micro-sitesthat are focussed on a particular audience and desired outcome. See alsoConversionandConversion Funnel.
languages	The languages used by your visitors, based on the language set up in their operating system.
Last 100 Visitors	SiteCatalyst report that gives the IP address and domain of the last 100 visitors to your site.
Last Run	This is the time the task in question last ran, whether successfully or not. As soon as the same task is run again, this value will change to the new start time.
Latency	On a web page, text or an image that has been coded to take a browser from one page to another or from one site to another.
Latent Semantic Indexing	LSI uses word associations to help search engines know more accurately what a page is about.
LATEST and greatest in web analytics	Just like the constantly changing online landscape, our services and analytics frameworks are always refined to offer you the most innovative way to measure your online performance. With customized services like campaign analytics, social media analytics, social media campaign analytics, mobile analytics, A/B and multivariate testing, and custom analytics, our clients are able to discover new opportunities.
Lead Generation	Web sites that generate leads for products or services offered by another company. On a lead generation site, the visitor is unable to make a purchase but will fill out a contact form in order to get more information about the product or service presented. A submitted contact form is considered a lead. It contains personal information about a visitor who has some degree of interest in a product or service.
Learning At Nabler	Nabler acts as a constant platform for learning. We believe learning is a two way process; its not that only person taught learns from the process, the teacher learns a lesson or two as well. Gaining knowledge is the most important job responsibility
Leased Line	A high-speed data line that is rented for exclusive 24/7 use between fixed locations.
Legal Compliance	These days websites must meet a number of legal regulations, especially with regard to disability access and privacy. For example, websites which sell goods online are required under EU law to be fully disability accessible. Regarding privacy, the EU privacy directive states that, as of May 25, 2011, websites may not set any cookies without express prior permission of their visitors a privacy policy does not constitute sufficient consent. We can check your website for legal compliance, and provide concrete technical solutions where it fails.
Lifetime Value	The total amount of a given success metric for a single user, for example, the total number of lifetime visits for a user.
Line Item	A row of data in a SiteCatalyst report.
Linear Allocation	Credit for a success event is allocated evenly among the values populated in the eVar. For example, a visitor uses three different keywords at different times, and the resulting purchase is $9.00. Each keyword is allocated 1/3 or $3 credit for the purchase.
Link Building	In order to leave your footprint on the internet, you need to be popular on the web. Link building is a strategy wherein your website gets references from other websites. The intention is to create numerous inbound links to your website. Search engines are fond of links and identify innumerable links as a sign of importance of your website. Outsource2india can publicize your website with a powerful link building strategy.
Link Cardinality	See Link Popularity.
Link Farming	The attempt to substantially and artificially increase link popularity.
Link Popularity	Link popularity generally refers to the total number of links pointing to any particular URL. There are typically two types of link popularity: Internal and External. Internal link popularity typically refers to the number of links or pages within a web site that link to a specific URL. External link popularity refers to the number of inbound links from external web sites that are pointing to a specific URL. If you have more links than your competitors, you are typically known to have link cardinality or link superiority .
Link Tracking	The method of tracking links as they pertain to ClickMap and other custom link code executions; i.e. variables and events.
Link	A hot-spot on a webpage, indicated by your cursor changing shape. Clicking a link will take you to another webpage.
Linkbait	Also known as link bait, this is something on your site that people will notice and link to. By linking to your site, other sites are saying they value the content of your site and that they think other people will be interested in it, too.
linkDownloadFileTypes Variable	The linkDownloadFileTypes variable is a comma-separated list of file extensions. If your site contains links to files with any of these extensions, the URLs of these links will appear in the File Downloads Report.
linked custom search engine	A custom search engine whose XML definition is hosted on a website and not by Google.
linkExternalFilters Variable I	If your site contains many links to external sites, and you dont want to track ALL exit links, linkExternalFilters can be used to report on a specific subset of exit links.
linkHandler Suite Plug-in	The linkHandler suite of plug-ins allows for easy detection of file download and exit links. It also provides insight into clicks to specific linkswithout modifying the existing link code.
Linking Profile	A profile is a representation of the extent to which something exhibits various characteristics. A linking profile is the result of an analysis of where your links are coming from.
linkInternalFilters Variable	linkInternalFilters is an optional variable used in conjunction with linkExternalFilters to determine whether a link is an exit link.
linkLeaveQueryString Variabl	The linkLeaveQueryString variable determines whether or not the query string should be included on Exit Links and Download Links.
linkName Variable	linkName is an optional variable used in Link Tracking that determines the name of a custom, download, or exit link.
linkTrackEvents Variable	linkTrackEvents is a comma-separated list of events that may be sent with a custom, exit, or download link. If an event is not in linkTrackEvents, then it will not be sent to SiteCatalyst, even if it is populated in the onClick event of a link.
linkType Variable	linkType is an optional variable used in link tracking that determines which report a Link Name or URL will appear in (Custom, Download or Exit Links).
LINX
Live Reports & Graphs	Sawmill statistics are live, for unparalleled flexibility while viewing the statistics. Sawmill shows you a collection of interlinked web pages which allow rapid navigation of the entire range of your log statistics. Convenient links and menus right on the statistics pages let you zoom in, set up real-time filters, show and hide columns of the tables and other view elements, sort the data however you want, and much more.
Live Support	Technology that allows businesses to communicate in real time with visitors to their website. Live support applications are commonly used to provide telephone or chat support and information to customers.
Load Balancer	Any device or software package that balances the load of incoming or outgoing items to maintain an even server load distribution. For the Web, a load balancer is a computer that sits between the Internet and two or more Web servers and processes the incoming requests to balance the load to each.
Local Area Network	(LAN) A more-or-less self-contained network of interconnected computers (that may connect to the Internet), usually in a single office or building.
Local Search	Search engine results constrained by region/location, based on the searchers location or intent. With the addition of Web 2.0 capabilities, local search results may include business ratings, reviews, maps and driving directions.
Location bar	The box at the top of your browser window where you type in the address of a website.
Log Analysis VS Page-based Tracking	Two methods exist for gathering stats about your site. Why use page-based tracking for your website analysis rather than analysing your servers log files? What difference does it make? Log Analysis VS Page-based Tracking
Log File Analysis	The analysis of records stored in the log file. In its raw format, the data in the log files can be hard to read and overwhelming. There are numerous log file analyzers that convert log file data into user-friendly charts and graphs. A good analyzer is generally considered an essential tool in SEO because it can show search engine statistics such as the number of visitors received from each search engine, the keywords each visitors used to find the site, visits by search engine spiders etc.Source
Log file	A file created by a web or proxy server which contains all of the access information regarding the activity on that server. Each line in a log file generated by web server software is a hit, or request for a file. Therefore, the number of lines in a log file will be equal to the number of hits in the file, not counting any field definitions line(s) that may be present.
Log File	A file created by a web or proxy server which contains all of the access information regarding the activity on that server.
Log Files	A text file created in the server capturing all activity on the website. This file is the primary source of data for analysis.
Log Format	Every log file is written in a particular format,. The major access log types are NCSA Extended Combined, which is commonly used by Apache; and W3C, which is commonly used by Microsoft IIS.
Log Rotation	Log Rotation is the practice of renaming a log file, often by adding a date-stamp, and storing it somewhere. This is done concurrently with creating a new log file for the storage of website usage data. Most log rotation is done on a daily basis.
Logfile analysis vs page tagging	Both logfile analysis programs and page tagging solutions are readily available to companies that wish to perform web analytics. In some cases, the same web analytics company will offer both approaches. The question then arises of which method a company should choose. There are advantages and disadvantages to each approach[3].
Login	A predefined account name, or the action of entering an account name, that accesses a private computer system. Logging in usually also requires the input of a valid password.
Long Tail	Keyword phrases with at least three, sometimes four or five, words in them. These long tail keywords are usually highly specific and draw lower traffic than shorter, more competitive keyword phrases, which is why they are also cheaper. Often times, long tail keywords, in aggregate, have good conversion ratios for the low number of click-throughs they generate.
Long-tailed Keywords	Keyword phrases with at least 2 or 3 words in them.
Longest Paths	SiteCatalyst report that displays the longest paths taken by visitors to your site during the current time period. You can view the complete path, including each page visited from beginning to end, for the longest paths.
LTV	Long-Term Value or Life-Time Value. Life-Time Value is a metric used to describe the value a specific customer has over the life of their relationship with you.
M-Commerce	The equivalent of e-commerce in the mobile environment.
Mac	An abbreviation for Apple Macintosh or Apple Mac. A popular type of computer with an easy-to-use graphical interface. Macs are often used by designers, video editors and music producers.
Mail Server	The hardware and software used by your Internet Service Provider to send and receive your email.
Mailbox	The place where your messages are saved by your ISP until you decide to download them. When they arrive on your computer, they are downloaded to your inbox.
Mailing List	Subject-based forums whose messages are distributed by email. You send your email address to a central point from which broadcast messages are received.
MapMyLead	This tool allows sales people to generate more leads from their websites. Most of the B2B sites have a conversion ratio of less than 2%. That means, you dont know who the remaining 98% is. MapMyLead helps our customers to know which companies are coming to their website and what they are doing. MapMyLead comes with advanced querying capabilities as well as a convenient scheduling feature to get data emailed on daily or weekly basis.
Marketing Performance Management (MPM)	Marketing Performance Management drives stronger customer relationships and higher lifetime value, based on a framework of established goals, consistent metrics, constant optimization across the entire marketing organization and across every customer
Marketing Warehouse	Part of Webtrends Marketing Lab, Marketing Warehouse complements Webtrends Analytics 8 by providing customer segmentation and business intelligence to fuel relationship marketing campaigns and optimize your marketing spend. See alsodata warehouse.
Marketing	A means of making a communication about a product or service so as to encourage recipients of the communication to purchase or use the product or service.
Medium (Campaign Tracking)	In the context of campaign tracking, medium indicates the means by which a visitor to a site received the link to that site. Examples of mediums are organic and cost-per-click in the case of search engine links, and email and print in the case of newsletters. The UTM variable for medium is utm_medium.
Megabyte	1024 kilobytes.
member ID	The member ID of your registered visitors.
Merchant Account	A bank account required by a shop to receive payments through electronic media such as credit cards. It can be considered as a virtual bank account in that receives electronic money.
Merchant ID	A number issued to a merchant by their acquiring bank. The bank uses the Merchant ID to identify a specific merchant in credit card transactions. Each store must have its own Merchant ID.
Merchant	A person or enterprise that buys and sells products and services to businesses or consumers.
Message Board	SiteCatalyst tool that enables multiple users to create or respond to discussion threads about various SiteCatalyst-related topics.
Meta Description Tag	Allows page authors to say how they would like their pages described when listed by search engines. Not all search engines use the tag.
Meta Feeds	Ad networks that pull advertiser listings from other providers. They may or may not have their own distribution and advertiser networks.
Meta Keywords Tag	:Allows page authors to add text to a page to help with the search engine ranking process. Not all search engines use the tag.
META Refresh redirect	A client-side redirect.
Meta Robots Tag	:Allows page authors to keep their web pages from being indexed by search engines, especially helpful for those who cannot create robots.txt files. The Robots Exclusion page provides official details.
Meta Search Engine	A search engine that gets listings from two or more other search engines, rather than through its own efforts.
Meta Tag	A special HTML tag that provides information about a web page. Unlike normal HTML tags, meta tags do not affect how the page is displayed. Instead, they provide information such as who created the page, how often it is updated, what the page is
Metadata	Data that describes data. Metadata describes how and when and by whom a particular set of data was collected, and how the data is formatted. Metadata is essential for understanding information stored in information repositories.
metric	A metric is any of the numbers, percentages, and ratios available in the reporting interface. Metrics indicate the performance of the reporting groups (i.e., reporting dimensions). A metric can be a number, a percentage, or a ratio. Note that not all metrics are compatible with all groups. Due to the manner in which analytics data is collected and cross-referenced across our system, in some cases data pertaining to one aspect of your tracked property may not be combined with all the metrics dedicated to a different analytics section (e.g., merchandising products groups might not be compatible with all campaign metrics).
microdata	An HTML5 specification for adding machine-readable content to web pages. A microdata item is created by adding anitemscopeattribute to an HTML tag and properties are added to it using anitemprop=”name”attribute to one of the items descendants. Microdata can often be used in place of microformats or RDFa. To learn more about microdata, seethe HTML5 standard for microdata.
microformats	Specifications for representing commonly published things such as reviews, people, products, and businesses. Generally, microformats consist of< span >and< div >elements and a class property, along with a brief and descriptive property
MIME	Multipurpose Internet Mail Extensions. A standard for attaching non-text files to standard internet mail messages. Non-text files include graphics, spreadsheets, formatted word-processor documents, sound files, etc.
Minimum Bid	The least amount that an advertiser can bid for a keyword or keyword phrase and still be active on the search ad network. This amount can range from $0.01 to $0.50 (or more for highly competitive keywords), and is set by the search engine.
Mobile Search	An evolving branch of information retrieval services that is centered on the convergence of mobile platforms and mobile handsets or other mobile devices. The services allow users to find mobile content interactively on mobile websites, and mobile content shows a media shift toward mobile multimedia.
Mod_rewrite	URL Rewrite processes, also known as mod rewrites, are employed when a webmaster decides to reorganize a current web site, either for the benefit of better user experience with a new directory structure or to clean up URLs which are difficult for search engines to index.
mode	A label attribute that allows you to promote, demote, or remove results matching your label. The modes are:filter,boost, andeliminate.
MODEM	Modulator-Demodulator. Hardware required to access computer networks via a telephone line. The modem converts digital data into analogue audio pulses that travel across telephone networks, and vice versa.
Monitor Color Depths	SiteCatalyst report that shows data on which color-depth the visitors to your web site have their computers set to. Color Depth refers to the number of colors that can be displayed on the screen.
Monitor Resolutions	SiteCatalyst metric that reports the monitor resolutions of visitors to your web site.
month of year	The selected month of the year.
Monthly Unique Visitor	The number of unduplicated (counted only once) visitors to your website over the course of a single month.
Most Popular Pages	SiteCatalyst report that lists all of the pages of your web site that are being tracked by SiteCatalyst, and tells you which pages are being visited the most.
Most Popular Servers	SiteCatalyst report that lists all of the servers of your web site that are being tracked by SiteCatalyst and tells you which servers are being accessed the most.
Most Popular Site Sections	SiteCatalyst report that lists all the site sections of your web site that are being tracked by SiteCatalyst and tells you which site sections are being visited the most.
MP3	MPEG-1 Audio Layer-3. A standard technology and format for compression of audio information into a smaller, digital file while preserving the original quality and level of sound.
MPM	Marketing Performance Management. Drives stronger customer relationships and higher lifetime value, based on a framework of established goals, consistent metrics, constant optimization across the entire marketing organization and across every customer touch point.
Multi-currency	Term used to define the Omniture process of translating any currency values passed in during the purchase event to the default report-suite currency.
multi-site rollup	A report that lets you track several projects in your account, allowing you to view, via the dashboards, some basic metrics across all those projects. The basic metrics include visits, page views, monthly unique visitors, number of sales, revenue average order value, and average time per visit.
Multi-Site Rollup	See Rollup.
Multi-Suite Tagging	The ability to send data to multiple report suites using a single image request.
Multihome	A multihome, or load balanced, network means distributing processing and communications activity evenly across a computer network so that no single device is overwhelmed. Load balancing is especially important for networks where its difficult to predict the number of requests that will be issued to a server. Busy websites typically employ two or more web servers in a load balancing scheme. If one server starts to get swamped, requests are forwarded to another server with more capacity.
Multivariate Testing Framework	See Multivariate Testing
Multivariate Testing	A type of testing that varies and tests more than one or two campaign elements at a time to determine the best performing elements and combinations. Multivariate testing can gather significant results on many different components of, for example, alternative PPC ad titles or descriptions in a short period of time. Often it requires special expertise to analyze complex statistical results. (Compare to A/B Testing which changes only one element at a time, alternately serving an old version ad and a changed one.) In search advertising, you might do A/B Split or Multivariate testing to learn what parts of a landing page (background color, title, headline, fill in forms, design, images) produce higher conversions and are more cost effective.
N-dimensional	Unlimited dimensions.
Naked Links	A posted and visible link in the text of a web page that directs to a web site.
Natural Search Keywords	Keywords that generate non-paid search results in a search engine. Depending on the search engine, Natural Search Keywords display neither at the top nor at the right-side column of the search results. See Paid Search Keywords.
Natural Search Page Ranking	The pages of natural search results shown on search engines. For example, if you search for avon renew on Google, a page of results is shown (in Google, they show 10 results per page by default and as recommended for search speed). This is page one. You can scroll to the bottom and then click to view the next page of results from Google, which would be page two.
Navigation	Describes the movement of a user through a website or other application interface. This term also indicates the system of available links and buttons that the user can use to navigate through the website.
NCSA	NCSA stands for the National Center for Supercomputing Applications. The NCSA developed several imporant web protocols and software systems, including the standard logging type used by Apache NCSA Extended Combined.
Negative Keywords	Filtered-out keywords to prevent ad serves on them in order to avoid irrelevant click-through charges on, for example, products that you do not sell, or to refine and narrow the targeting of your Ad Groups keywords. Microsoft adCenter calls them excluded keywords. Formatting negative keywords varies by search engine; but they are usually designated with a minus sign.
Netscape Plug-ins	This SiteCatalyst report lets you learn which plug-ins are used by the visitors to your site (available only for Netscape). This information is especially useful if your web page has content that requires a specific plug-in.
Network	A group of two or more computer systems linked together. There are many types of computer networks, including: Local-Area Networks (LANs), where the computers are geographically close together (in the same building). WideArea Networks (WANs), where
New Methods of Web Analysis	Some Polish academics contacted me withradical ideasformuchmore accurate methods of analysing websites. This article proposes a paradigm shift in how we see web analysis. At the time this generated a great deal of discussion, but nothing came of it. I suspect this is because it would have made all existing systems obsolete. I am convinced that one day this is what well do. Web Analytics Paradigm Shift
New visitor	Google Analytics records a visitor as new when any page on your site has been accessed for the first time by a web browser. This is accomplished by setting a first-party cookie on that browser. Thus, new visitors are not identified by the personal information they provide on your site, but are rather uniquely identified by the web browser they used.
Newsletter Campaigns	Newsletters are a novel way of keeping in touch with your target audience. Outsource2india can help you put together effective newsletters on a regular basis.
Next Pag	SiteCatalyst report that provides detailed site path analysis by pinpointing where your visitors go within your site after leaving any given page on your site. For example if you want to find where your visitors go after your home page, this report will show you the top five pages your visitors go to after leaving your home page.
Next Page Flow	SiteCatalyst report that graphically illustrates two levels of the most popular next pages that your visitors view following the selected page. The report also highlights when visitors exit your site.
NIC	Network Interface Card. A piece of hardware which plugs into a computer and provides a network interface to the appropriate protocol, usuallyEthernet.
NNTP	Network News Transport Protocol. A protocol used to transfer information to and from newsgroups.
No Follow	No Follow is an attribute webmasters can place on links that tell search engines not to count the link as a vote or not to send any trust to that site. Search engines will follow the link, yet it will not influence search results. No Follows can be added to any link with this code: rel=nofollow.
No Referral	The (no referral) entry appears in various Referrals reports in the cases when the visitor to the site got there by typing the URL directly into the browser window or using a bookmark/favorite. In other words, the visitor did not click on a link to get to the site, so there was no referral, technically speaking.
No Script Tag	The noscript element is used to define an alternate content (text) if a script is NOT executed. This tag is used for a browser that recognizes the < script > tag, but does not support the script in it.
Node	A single computer connected to a network.
Non-referrals	Visitors who arrive at a site by typing a domain into an address bar, using a bookmark, or clicking on an emailed URL. See alsoreferrals.
OCR	Organic Click Rate. See alsoPPC.
ODBC	Open Database Connectivity.This interface standard provides a common application programming interface (API) for accessing databases. This gives users access to data that is created with other software. For example, using Webtrends ODBC connectivity, you can export Webtrends Report data into a Microsoft Access database.
Offline	The state of being disconnected from a network.
on-demand indexing	A feature that lets you have selected pages on your website indexed within 24 hours. To learn more, seeGetting Started.
On-demand service	Webtrends Marketing Lab is available as both an on-demand service and a software solution.
On-site web analytics technologies	Many different vendors provide on-site web analyticssoftwareandservices. There are two main technological approaches to collecting the data. The first method,logfile analysis, reads thelogfilesin which the web server records all its transactions. The second method,page tagging, usesJavaScripton each page to notify a third-party server when a page is rendered by aweb browser. Both collect data that can be processed to produce web traffic reports.In addition other data sources may also be added to augment the data. For example; e-mail response rates, direct mail campaign data, sales and lead information, user performance data such as clickheat mapping, or other custom metrics as needed.
Online Reputation Management (ORM)	The act of monitoring, addressing or mitigating undesirable search engine results or mentions in online media for a company or product. Techniques include generating new content and creating posts on existing content.
Online	Connected to a network.
OpenSearch	A collection of technologies that allow publishing of search results in a format suitable for syndication and aggregation. It is a way for websites and search engines to publish search results in a standard and accessible format. Originally developed
OpenSocial	A common API for social applications across multiple websites. With standard JavaScript and HTML, developers can create applications that access a social networks friends and update feeds. For more information, visit theOpenSocial site.
operating system versions	The versions of the operating systems your visitors use.
OPML	Outline Processor Markup Language. OPML is a type of XML format that was originally developed for defining ordered lists of elements or outlines, but it is now also commonly used for web feeds. You can learn more about OPML by reading itsdocumentation. As for learning more about how it is used in Custom Search, seeSelecting Sites to Search.
Opt-in	This permission-based email communication requires customers to verify the opt-in method before their e-mail addresses can be used to communicate with them.
Optimization Testing	Testing is integral to the running of a successful website, as without experimenting, no one can tell what is really working.
Order Confirmation Number	A unique identifier for this particular purchase transaction from this particular user. This number is generated and then stored as part of the permanent record of the purchase.
Order ID	A unique identifier assigned to a customer order.
Order Tracking	The process of tracing the status of particular order placed by a customer with an online store.
Order	A record of a request for goods or services by a customer.
Organic Results	Listings on SERPs that were not paid for; listings for which search engines do not sell space. Sites appear in organic (also called natural) results because a search engine has applied formulas (algorithms) to its search crawler index, combined with editorial decisions and content weighting, that it deems important enough inclusion without payment. Paid Inclusion Contentis also often considered organic even though it is paid advertising because paid inclusion content usually appears on SERPs mixed with unpaid, organic results.
Organic Search Listings	Listings that search engines do not sell (unlike paid listings). Instead, sites appear solely because a search engine has deemed it editorially important for them to be included, regardless of payment. Paid Inclusion Contentis also often considered
Organic Search Rankings	Search engine ranking of web pages found in SERPs.
Organic Search	Users find results through unpaid search engines, unlike PPC.
Organic/Natural Listings	Listings that search engines do not sell (unlike paid listings). Instead, sites appear solely because a search engine has deemed it editorially important for them to be included, regardless of payment. Paid inclusion content is also often considered organic even though it is paid for. This is because that content usually appears intermixed with unpaid organic results.
Organization	The classification to which a Domain Name belongs. Typical Suffixes are: .com = Commercial, .org = Organization, .edu = Educational, .int = International, .gov = Government, .mil = Military, net = Network
Original Referrer	Customers can visit your site multiple times, and have a different referrer for each visit. The original referrer is the referrer customers used the first timethey arrived at your site.
OS	(Operating System) Software designed to control the hardware of a specific data-processing system in order to allow users and application programs to employ it easily. (MacOS, Windows 95)
Other methods	Other methods of data collection are sometimes used. Packet sniffing collects data bysniffingthe network traffic passing between the web server and the outside world. Packet sniffing involves no changes to the web pages or web servers. Integrating web analytics into the web server software itself is also possible.[7]Both these methods claim to provide betterreal-timedata than other methods.
Our solution	We helped this customer address the problem by building an application that would parse XML reports from Google Analytics. It would then automatically create an Excel sheet from the XML file.
Our Strengths	The Nabler team with their individual set of ambitions, ideas, vision and eccentricities is what we proudly claim as our strength.
Our Team	Nabler is headed by Seby Kallarakkal, an IIT Graduate who strongly believes in the power of the Internet to make businesses global and sustainable. We also have a small team of Web Analysts and Software developers who work together to realize the dreams and vision that Nabler stands for.
Our Values	Values define what an organization stands for and what it believes. At Nabler, good values are respected and nurtured. Every employee respects the values of the other and tries to learn from one another.
Our Vision	Our vision is to nurture an organization where ethics and values are held high, where the work culture is one that every employee would cherish, where promises are kept and relationships valued above everything.
Our Work Culture	Nablers work culture can be aptly described in a word unique! At Nabler, employees themselves choose a set of work discipline that helps them be efficient. Independence and flexibility are stressed upon. Creativity and originality of thoughts are most respected.
Outbound Links	Links on a particular web page leading to other web pages on a different domain.
Output Controls	Section of the application code that provides the user with a direct report on the status of an order.
Over Time Report	Used in the Purchases, Shopping Cart and Custom Events reporting sections. Similar to the Page Views report, each of these reports displays data for one Success Metric over a specific time period, such as a day, week, month, etc.
overlay	An Ajax-based screen that appears in an existing webpage.
P2P	Peer-To-Peer. A network scheme in which individual computers communicate directly with each other to share files.
P3P (Platform for Privacy Preferences Project)	Developed to provide a simple, automated way for individuals to gain more control over the use of personal information on Web sites they visit. P3P enhances user control by putting privacy policies where users can find them, in a form users can understand, and in a way in which they can act on what they see. (See http:// www.w3.org/P3P for more information.)
P4P	Acronym for Pay for Performance, also designated as PFP. See alsoPPC Advertising.
Packet Switching	The method used to move data around on TCP/IP networks. Data coming out of a machine is broken up into chunks, each chunk includes the address of where it came from, and where it is going. This enables chunks of data from many different sources to co-mingle on the same lines, and be sorted and directed to different routes by special machines along the way. This way many people can use the same lines at the same time.
Page Depth / Page Views per Session	Page Depth is the average number of page views a visitor consumes before ending their session. It is calculated by dividing total number of page views by total number of sessions and is also called Page Views per Session or PV/Session.
Page Duration	Time spent by visitor on a web page.
Page Impression	SeeImpression.
Page Naming Tool	The Page Naming Tool may be used change page names as they appear in SiteCatalyst, or in other words, to give pages friendlier names. This tool allows you to change the displayed page name rather than the value of the pageName variable, which may improve the readability of your reports.
Page Naming	Strategy Omniture uses during SiteCatalyst implementation to determine how a web page name will be displayed in SiteCatalyst reports, for example, the page URL can display (www.mysite.com) or a friendly name can display (Electronics).
Page Not Found	The 404 or Not Found Error Message is an HTTP standard response code indicating that the client was able to communicate with the server, but the server either could not find the file that was requested, or was unwilling to fulfill the request for it and did not wish to reveal the reason why.
Page page views per visit	Average number of pages viewed during a visit. Calculated by dividing the number of Page Views by the number of Visits.
Page Summary	SiteCatalyst report that collects and organizes page-specific information about a single page and presents it in a single report.
Page tagging	Concerns about the accuracy of logfile analysis in the presence of caching, and the desire to be able to perform web analytics as an outsourced service, led to the second data collection method, page tagging or Web bugs. In the mid 1990s,Web counterswere commonly seen these were images included in a web page that showed the number of times the image had been requested, which was an estimate of the number of visits to that page. In the late 1990s this concept evolved to include a small invisible image instead of a visible one, and, by using JavaScript, to pass along with the image request certain information about the page and the visitor. This information can then be processed remotely by a web analytics company, and extensive statistics generated.The web analytics service also manages the process of assigning a cookie to the user, which can uniquely identify them during their visit and in subsequent visits. Cookie acceptance rates vary significantly between web sites and may affect the quality of data collected and reported.Collecting web site data using a third-party data collection server (or even an in-house data collection server) requires an additionalDNSlook-up by the users computer to determine the IP address of the collection server. On occasion, delays in completing a successful or failed DNS look-ups may result in data not being collected.With the increasing popularity ofAjax-based solutions, an alternative to the use of an invisible image, is to implement a call back to the server from the rendered page. In this case, when the page is rendered on the web browser, a piece of Ajax code would call back to the server and pass information about the client that can then be aggregated by a web analytics company. This is in some ways flawed by browser restrictions on the servers which can be contacted withXmlHttpRequestobjects. Also, this method can lead to slightly lower reported traffic levels, since the visitor may stop the page from loading in mid-response before the Ajax call is made.
Page Tags	Tags are JavaScript codes embedded in the web page to be executed by the browser. Tags are used to generate log files used by certain Web Analytics Tools.
page title	The titles of your web pages as you have defined them. Yahoo! Web Analytics enables you to assign each web page a unique name for the purposes of reporting. If you do not define a page name, Yahoo! Web Analytics will use that pages HTML title tag in the reports.
page URL	The URLs of your web pages.
Page Value	SiteCatalyst report that shows you how much certain pages participated in generating revenue. For events, the Page Value Report uses allocationmetrics.
Page View Duration / Time on Page	Average amount of time that visitors spend on each page of the site. As with Session Duration, this metric is complicated by the fact that analytics programs can not measure the length of the final page view unless they record a page close event.
page view	An instance of a visitor loading a particular web page from your site. For example, if someone visits your homepage, we count a page view for that page. Compare toclick,visit, andunique visitor.
Page Viewed by Key Visitors	SiteCatalyst report that lists all of the pages of your web site that have been visited by the key visitors you have specified. Key visitors are defined by entering the domain or IP address of the groups you would like to track. You may enter up to five key visitor groups.
page views %	Percentage of page views.
page views per unique visitor	Average number of pages viewed by a daily unique visitor. Calculated by dividing the number of page views by the number of daily unique visitors.
Page Views	Used to convey relative popularity of pages within your site. Number of pages successfully loaded from your site for visitors. This excludes error pages and views by search engine robots/spiders.
Page	A single file on an organizations web server, or any iteration of how the file is presented on the organizations web site.
PageMaps	Structured data format Google created to enable website creator to embed data and notes in their webpages. Unlike other structured data formats, PageMaps does not require you to follow standard properties or terms, or even refer to an existing vocabulary, schema, or template. You can just create custom attribute valuess that make sense for your website. To learn more, seeProviding Structured Data.
pageName Variable	The pageName variable is used to identify each page that will be tracked on the site. If the pageName variable is not populated with a defined value like Home then SiteCatalyst will record the URL as the page name.
PageRank (PR)	PR is the Google technology developed at Stanford University for placing importance on pages and web sites. At one point, PageRank (PR) was a major factor in rankings. Today it is one of hundreds of factors in the algorithm that determines a pages rankings.
Pages per Visit	This tells you the average number of pages that get viewed during each visit. Higher numbers indicate that your visitors read multiple pages before they leave.
pageType Variable	The pageType Variable is used only to designate a 404 Page Not Found Error Page. It only has one possible value, which is errorPage.
pageURL Variable	The pageURL variable overrides the actual URL of the page in cases where the URL of the page is not the URL that you would like reported in SiteCatalyst.
Pageview	A page is defined as any file or content delivered by a web server that would generally be considered a web document. This includes HTML pages (.html, .htm, .shtml), script-generated pages (.cgi, .asp, .cfm, etc.), and plain-text pages. It also includes sound files (.wav, .aiff, etc.), video files (.mov, etc.), and other non-document files. Only image files (.jpeg, .gif, .png), javascript (.js) and style sheets (.css) are excluded from this definition. Each time a file defined as a page is served, a pageview is registered by Google Analytics.
Paid Inclusion	Refers to the process of paying a fee to a search engine in order to be included in that search engine or directory. Also known as guaranteed inclusion. Paid inclusion does not impact rankings of a web page; it merely guarantees that the web page itself will be included in the index. These programs were typically used by web sites that were not being fully crawled or were incapable of being crawled, due to dynamic URL structures, frames, etc.
Paid Search Detection	SiteCatalyst utility that configures rules that SiteCatalyst can use to determine if specific search engines were paid (your company paid a fee for the search engine to list your site), and allows SiteCatalyst to identify the keywords that were used on a paid search engine. The marker is determined by the client; it must be in the query string, and is case-sensitive.
Paid Search Keywords	Keywords that return high visibility search results in a search engine. The results are displayed at the top of the results or in a special report for paid keywords.
Parameters	These are located in the URL immediately after a question mark and followed by an equal sign and a return value, known as name=value.
Partial Shipment	A process in which a store ships / provides only some of the goods / services in a single order. Therefore, the PostAuth amount would be less than the amount of the approved PreAuth for the order.
Participation	Participation metrics assign full revenue credit to each page used to generate the revenue. In the example above, the visitor navigates through five pages of your site, which results in a $10,000 purchase. The participation metric gives $10,000 credit to each page used in the purchase process. If any events have participation enabled, then the pages participating in the event also have participation enabled.
Password Maintenance	Procedures and processes used to establish and maintain the password portion of the authentication service that allows access to application systems.
Password	A password is the word or code used to authenticate a user on the Google Analytics administration or reporting system, or any other protected system. It is advisable to use passwords that are difficult to guess, such as those containing numbers or symbols.
Path Analysis	Analysis on how visitors traverse through the website. Gives valuable information to check if they follow the intended site navigation etc.
Path	A Path is defined as a series of clicks resulting in distinct pageviews. A Path cannot contain non-pages, such as image files. Each step in a path will have a name, such as index.html.
Pathfinder Wizard	The wizard used to generate the Pathfinder Report in SiteCatalyst.
Pattern Builder Canvas	Second part of the Pathfinder Wizard in the Pathfinder Report that enables you to drag-and-drop desired items for display in the report.
Pattern Builder	The first step in the Pathfinder Wizard, which enables you to select pattern type that will display a template to specify a type of path.
Pay Per Call	A model of paid advertising similar to Pay Per Click (PPC), except advertisers pay for every phone call that comes to them from a search ad, rather than for every click-through to their web site landing page for the ad. Often higher cost than PPC advertising; but valued by advertisers for higher conversion rates from consumers who take the action step of telephoning an advertiser.
pay per click (also ‘cost per click’)	An advertising model in which the advertiser pays a certain amount to the publisher each time the advertisers ad is clicked from the publishers site.
Pay-per-click (PPC)	Outsource2indias PPC services can help you have a good online paid advertising strategy. We identify the most relevant keyword that you need to pitch for and the most profitable online platforms to advertize on. The PPC model has worked really well for companies as they need to pay for the ads posted on major search engines as and only when their ads are clicked by visitors. Apart from paid search marketing services, Outsource2india also provides affiliate marketing services.
Pay-per-click	An advertising model in which the sponsor (advertiser) pays a certain amount to the publisher each time the sponsors ad is clicked. Also referred to as cost-per-click.
Payment Gateway	Computer system that acts as a mediator between a merchant account and an online shop. The payment gateway authenticates payment card information and manages real-time charging from a payment card.
PCRE	Initialism for Perl-compatible regular expressions. Seeregular expressions.
PDA	Personal Digital Assistant. A kind of electronic Filofax with email capability.
PDF	Portable Document Format. File format developed by Adobe Systems to allow for display and printing of formatted documents across platforms and systems. PDF files can be read on any system equipped with the Acrobat Reader software, regardless of whether or not your computer has the software that the document was created in.
PEF	Personal Experience Factor is the customers interaction with your website, advertising, or brand.
Performance Indicators	SeeKPIs.
Performance-based Advertising	PPC and Affiliate Marketing are examples of performance-based advertising. In 2008 this accounted for 50% of all online ad spend. There are advantages for advertisers and publishers with performance-based advertising, but publishers need to change their approach and train their sales staff. The Rise of Performance-based Advertising
Perl-compatible regular expressions.	Seeregular expressions.
Persistent Cookies	A cookie (text file) that stays on a visitors computer between visits so that Omniture can identify the visitor in subsequent visits.
Personas	These are people types or sub-groups that encompass several attributes, such as gender, age, location, salary level, leisure activities, lifestyle characteristics, marital/family status or some kind of definable behavior. Useful profiles for focusing ad messages and offers to targeted segments.
PFP	Acronym for Pay for Performance; also designated as P4P. See also PPC Advertising.
PIE	Persistent Identification Element is a type of tag that is attached a users browser, providing a unique ID similar to traditional cookie coding.
PIN	A unique number used to authorise a bank or credit card.
PKI	Public Key Infrastructure. A mechanism for secure communications over a network or the internet.
Platform	A platform is a specific computer hardware and software operating system combination that represents a specific users configuration and method of accessing the Internet. Common platforms include Windows NT/x86 (Microsoft Windows NT on a standard Intel-type PC), Mac PPC (Macintosh with Power PC processor), Red Hat Linux 6.1 x86 (Linux on a standard Intel-type PC).
Plug-In	A plug-in is a piece of software that enhances the capabilities of an application.
PMML (Predictive Modeling Markup Language)	An effort by the Data Mining Group (DMG) to make predictive models interchangeable. PMML is based on XML. It supports a number of different statistical predictive model types.
POA	Point of Action is the location of a conversion event.
POC	Percentage of Completion or Proof of Concept
Podcasts	A podcast is a media file that is distributed over the internet using syndication feeds, for playback on portable media players and personal computers. Like radio, it can mean both the content and the method of syndication. The latter may also be termed podcasting. The host or author of a podcast is often called a podcaster.Source: Wikipedia
Point-in-time services	Point-in-time services occur when customers are directed to your website to perform a well-defined task at a specific, often event-driven, time. Examples include payment for utility bills, product registrations, technical support, and customer satisfaction surveys.
Point-Of-Sale	A facility, such as a web form, through which an online merchant accepts new customer orders.
POP3	Post Office Protocol 3. A protocol used by email systems to retrieve messages from a mail server.
Popup Blocker Detection Plug-in	The Popup Blocker Detection Plug-in determines the number of visits where a popup blocker is enabled or disabled.
POS	Point of sale, the physical or virtual checkout.
Position Preference	A feature in Google AdWords and in Microsoft adCenter enabling advertisers to specify in which positions they would like their ads to appear on the SERP. However, it does not provide a position guarantee.
Position	In PPC advertising, position is the placement on a search engine results page where your ad appears relative to other paid ads and to organic search results. Top ranking paid ads (high ranking 10 to 15 results, depending on the engine) usually appear at the top of the SERP and on the right rail (right-side column of the page). Ads appearing in the top three paid-ad or Sponsored Ad slots are known as Premium Positions. Paid search ad position is determined by confidential algorithms and Quality Score measures specific to each search engine. However, factors in the engines position placement under some advertiser control include bid price, the ads CTR, relevancy of your ad to searcher requests, relevance of your click-through landing page to the search request, and quality measures search engines calculate to ensure quality user experience.
Possible “Failed” codes are	400 = Failed: Bad Request 401 = Failed: Unauthorized 402 = Failed: Payment Required 403 = Failed: Forbidden 404 = Failed: Not Found 500 = Failed: Internal Error 501 = Failed: Not Implemented 502 = Failed: Overloaded Temporarily 503 = Failed: Gateway Timeout
Possible “Success” codes are	200 = Success: OK 201 = Success: Created 202 = Success: Accepted 203 = Success: Partial Information 204 = Success: No Response 300 = Success: Redirected 301 = Success: Moved 302 = Success: Found 303 = Success: New Method 304 = Success: Not Modified
Post	There are two methods to send HTML form data to a server. GET, the default, will send the form input in an URL, whereas POST sends it in the body of the submission. The latter method means you can send larger amounts of data, and that the URL of the form results doesnt show the encoded form.
PostAuth	A transaction that converts a PreAuth transaction into a Captured state for settlement. In the case of partial shipments, the PostAuth amount may be less than the PreAuth amount. PostAuths are usually initiated after purchased goods have been shipped. See alsoAuthandPreAuth.
Postmaster	The person at an ISP, or company, in charge of email services.
Powerful Dynamic Filters	Sawmill allows dynamic segmentation of reports through its advanced filtering capabilities. Simple one-click zoom filters provide easy zooming into any item appearing in an report. For more advanced dynamic filtering, Sawmill provides advanced Boolean (AND/OR/NOT) selection based on multiple criteria, including wildcards and regular expressions. Log data can also be filtered on import using Log Filters, which use Salang, Sawmills build-in scripting language, for extremely flexible filtering and conversion.
PPC Advertising	Acronym for Pay-Per-Click Advertising, a model of online advertising in which advertisers pay only for each click on their ads that directs searchers to a specified landing page on the advertisers web site. PPC ads may get thousands of impressions (views or serves of the ad); but, unlike more traditional ad models billed on a CPM (Cost-Per-Thousand-Impressions) basis, PPC advertisers only pay when their ad is clicked on. Charges per ad click-through are based on advertiser bids in hybrid ad space auctions and are influenced by competitor bids, competition for keywords and search engines proprietary quality measures of advertiser ad and landing page content.
PPC Management	The monitoring and maintenance of a Pay-Per-Click campaign or campaigns. This includes changing bid prices, expanding and refining keyword lists, editing ad copy, testing campaign components for cost effectiveness and successful conversions, and reviewing performance reports for reports to management and clients, as well as results to feed into future PPC campaign operations.
PPC	Acronym for Pay Per Click. See also PPC Advertising.
PPCSE	Acronym for Pay-Per-Click Search Engine.
PPP	Pay-Per-Performance. A method of remuneration in online advertising.
PreAuth	A transaction type in which a cardholders account is verified to be in good standing, that is, the card is valid, is within its limit, and any applicable Address Verification Service checks have been performed and approved. If the verifications are approved, the total amount of the order is reserved against the cardholders account balance. PreAuths are used if goods are to be physically shipped or in other cases for which the merchant must first verify whether the order can be fulfilled. An approved PreAuth is followed by a PostAuth, which prepares it for settlement. See alsoAuthandPostAuth.
Precede Pattern	Pattern in the Pattern Builder that enables you to analyze the page(s) preceding a selected page
presentation layer	A set of code (such as JavaScript, PHP, JSP and ASP) that transforms the raw data into a format that is displayed to the user. In the case of Custom Search, you transform XML data into an HTML file that is presented to the end user.
Previous Page Flow	SiteCatalyst report that graphically illustrates two levels of the most popular pages that your visitors view prior to the selected page. The report also highlights when visitors enter your site.
Previous Page	SiteCatalyst report that provides detailed site path analysis by showing you where visitors to each page in your site came from. For example if you have a features page, this report will show you the top five pages your visitors came from to get to your features page.
Primary Server Calls	In multi-suite tagging, it is the first call to the Omniture servers. Any subsequent calls to the Omniture servers are charged at a price determined in the clients contract. For example, if s_account=abc,123 the call to abc is the primary server call.
Prior Unique Visitor	A Prior Unique Visitor is defined as a unique visitor to the website that returned during the specified Date Range after previously visiting your site, as identified by tracking devices such as cookies.
Privacy Policy	Omnitures official statement on the type of information collected on a site, how the information will be used, how the person can access this data and the steps for having the data removed. A privacy statement will also usually include information regarding systems that are in place to protect the information of web site visitors.
private Page Map	A PageMap with data protected by an AccessKey. Private PageMap data cannot be returned in XML unless the AccessKey is specified by thepgmpkparameter of the search URL. To learn more, seeProviding Structured Data.promotionA specially created result that appears at the top of the results page. It associates a created result with a pre-defined set of query terms.
Processes Almost Any Log File	Sawmill can process the text log files generated by all popular devices and servers, in over 700 formats. If you want to analyze a log in a different format, Sawmill also lets you specify a custom log format. If your log is generated by publicly-available software, we can do this for youjust email a sample of your log file to sawmill@flowerfire.com, and we can write you a log format file that you can plug right in to your copy of Sawmill.
Processing Controls And Edits	Processing controls and edits are sections of application or operating system code that focus on ensuring the integrity of the interaction with the user. An example of this type of code would be code that would ensure that all required changes to a set of databases were made as part of a transaction.
product revenue	Total value of the product revenue.
product units	Number of product units purchased by your visitors.
product views	Number of times your products were viewed.
Product	One of several values contained in the products variable; i.e., products=category; product; quantity; price;
products Variable	The products variable is used for tracking multiple products and product categories as well as purchase quantity and purchase price, and event serialization and merchandising.
Professional Services	See Client Services.
Profile	A Profile is a set of rules governing the production of a set of Google Analytics reports from log file data. Generally, there will be one Profile per domain/URL (e.g., www.googleanalytics.com). However, there can be any number of Profiles for any one source, as each may have different rules for exclusion or inclusion of certain log data elements.
project	Yahoo! Web Analytics uses the term project to denote a tracked web property (i.e., a web site). A project is defined by the unique tracking code that you embed in your web pages. The reporting interface groups all pages tracked with that code into one set of statistical reports. You may also track pages from multiple domains within the same project, or even divide one website into multiple projects.
Promotion	A message issued in behalf of a product, cause, idea, person, or institution.
Property	Sub-groups or subsections of channels. For example, assume a Web site has two channels: news and weather. The news channel may then have the following four sub-groups, or Properties: national, local, sports and politics.
propN Variable	Property (prop) variables are used for building custom reports within SiteCatalysts Traffic Module for pathing reports, or in correlation reports.
Protocol	A set of rules governing the transmission and reception of data.
Psychographics	Data used to build customer segments based on attitudes, values, beliefs and opinions as opposed to the factual characteristics. See alsodemographics.
Public Domain	Signifies that copyright holders have either donated their products to the public, or copyright has expired. If something is in a public domain, you can use it without infringing any rights.
Purchase Event	A success event in which a visitor to your site purchases a product.
Quality Index	For details, please see Quality Score
Quality Score	A number assigned by Google to paid ads in a hybrid auction that, together with maximum CPC, determines each ads rank and SERP position. Quality Scores reflect an ads historical CTR, keyword relevance, landing page relevance, and other factors proprietary to Google. Yahoo! refers to the Quality Score as a Quality Index. And both Google and Yahoo! display 3- or 5-step indicators of quality evaluations for individual advertisers.
Query Parameter	Any VARIABLE=VALUE pair that follows the question mark (?) in a URL. Google Analytics receives campaign information from query parameters appended to destination URLs. For example, the URL http://www.example.com/search?q=foo contains the query parameter q=foo.
Query String Parameter	Alphanumeric value that uniquely identifies each element and its exact placement on the Internet that ultimately brings an end user to your website. The query string parameter is usually offset by a question mark (?) in the URL. For example, the URL http://www.mysite.com?cid=12345 contains a query string parameter of 12345. Also see Query String.
query string	The part of a URL that is displayed after the question mark. It is composed of a series of field-value pairs and contains data to be passed to web applications.
Query String	The text data following the ? in a URL. It is the part of a URL that conveys parametric data to the server.
query term	The search term that would trigger apromotion.
Query Token	A query token is a special character in URL that differentiates the main URL from the specific query. For example, in this URL: http://www.google.com/search?q=analytics the query token is the question mark.
Query	The keyword or keyword phrase a searcher enters into a search field, which initiates a search and results in a SERP with organic and paid listings.
R&D DIVISION for customized solutions	Our R&D division continuously churns out innovative and customized products for specific analytics requirements of global businesses. Learn more about our unique tools:6ScienceandMapMyLead.
Rank	How well positioned a particular web page or web site appears in search engine results. For example, if you rank at position #1, youre the first listed paid or sponsored ad. If youre in position #18, it is likely that your ad appears on the second or third page of search results, after 17 competitor paid ads and organic listings. Rank and position affect your click-through rates and, ultimately, conversion rates for your landing pages.
RDFa	Resource Description Framework in attributes is a standard W3C specification with attributes that extends XHTML to include semantic metadata. See alsostructured data format.
Reach	Reach can be defined as the probability of gaining the attention of your prospective visitors
Rear-View Mirror Metrics	Metrics that measure what has occurred. For example campaign response metrics are such metrics that tell you how a campaign performed.
Recency	The number of days since a visitors most recent visit during a reporting period. See alsofrequency.
Reciprocal Link	Two different sites that link out to each other. Also referred to as Cross Linking.
redirect	URL redirection. A method for making a web page available under another URL.
Referral Errors	A referral error occurs whenever someone clicks on a link that points to your site but that contains a reference to a non-existent page or file. This action usually results in a 404 Not Found-type error.
Referral site	used to determine the source of your web traffic and is defined as the site URL or title where your visitors came from.
Referrals	A referral occurs when any hyperlink is clicked on that takes a web surfer to any page or file in another website; it could be text, an image, or any other type of link. When a web surfer arrives at your site from another site, the server records the referral information in the hit log for every file requested by that surfer. If a search engine was used to obtain the link, the search engine name and any keywords used are recorded as well.
referrer Variable	The referrer variable may be used to restore lost referrer information.
Referrer	The URL of an HTML page that refers visitors to a site.
referring domain	Domains that refer visitors to your website.
Referring Sites	These visits came to you by clicking a link on another site.
referring URL	The URLs of the domains that referred visitors to your website. Also called referrer.
refinement label	A label that appears as links at the top of the search results page. The label is applied to the search engine only when users click the refinement link. For contrast, seesearch engine label. To learn more, seeHelping Your Users Refine Their Searches.
refinement link	A link that appears at the top of the search results page. Users can click a refinement link to to narrow their searches. To learn more, seeHelping Your Users Refine Their Searches.
refinement	A way to categorize sites by topics. You can create refinement labels that you associate with sites; refinement links then appear at the top of your search results page, and users can click them to narrow down their searches. To learn more, seeHelping Your Users Refine Their Searches.
region / state code	Your visitors local regions.
Regular Expressions	Regular Expressions are tools defined by the POSIX specification used to match text strings based on rules invoked by special characters, such as asterisks (*). Regular Expressions are powerful tools and should be fully understood before use. For more information, please see the IEEE site .
Relationship Marketing	Relationship marketing is a type of marketing that traces its roots to direct response marketing. It emphasizes building long-term relationships with customers rather than individual transactions. It requires understanding customer needs as they go through life cycles of interacting and purchasing from organizations, and requires that marketers accurately determine customer intent in order to provide them the right message at the right time.
Relevance	The concept Relevance is related to the PPC advertising. Relevance is a measure of how closely your ad title, description, and keywords are related to the search query and the searchers expectations.
Repeat Session	This is a session for which the visitor could be tracked as unique and as having been to the site before this session during the current Date Range .
Repeat Visitor	A visitor who has visited your website before. A higher percentage of returning users is good news for your website since it shows that they are interested in your products, services or content.
Report Accelerator	The Report Accelerator caches data for up to 30 minutes in order to speed up report generation times.
report cache	The report cache lets you store the most recently accessed reports andKPIsand serves them from the cache to all account users. This caching results in considerably faster report loading times. Report caching can be configured at the project level.
Report Suite	The report suite is the most fundamental level of segmentation in SiteCatalyst reporting. You can set as many report suites as your contract allows. Each report suite refers to a dedicated set of tables that are populated in Omnitures collection servers. A report suite is identified by the s_account variable in your JavaScript code. By defining a business segment as a report suite, you can view SiteCatalysts full set of reports within the report suite.
Report	A report set is a distinct Google Analytics report about one particular web site, part of a web site, or content group. A report set will have all of Google Analytics reporting features dedicated to the analysis of itself only. Generally, one report set is defined for each web site, though more than one can be configured.
Report-specific Success Metrics	Refers to elements that apply only to the report you are viewing. They can also be described as happens events, such as the number of times a product is viewed (Product Views) or a campaign is clicked (Click-throughs). These metrics appear at the bottom of the Item Summary report and are also displayed in the Conversion Base reports when selected from the Success Metrics menu.
Reportlet ± Canvas Builder	Wizard in SiteCatalyst that lets you generate any SiteCatalyst report without having to navigate through the list of available reports.
Request URI	A request URI identifies a page or a set of pages on your website by the path and/or query parameters. For instance, if www.example.com is developed in static HTML and it has a page on that site called about.html located in a sub-directory on the site, the request URI for that page might be /company/information/about.html. On the other hand, if example.com is developed in php, the request URI for that page might look like /pages.php?topNav=Company&sideNav=about&page=about.php. You can use this field to refine report data by a known request URI for a single page or a set of pages. Results are returned for all pages that match your indicated string. For example, you could enter /company/about/ to include or exclude all pages in the about directory of the company website.
Resources
Retention	This refers to your customer buying from you again Campaign analysis consists of tracking, measuring and optimizing website visitor traffic and reach and acquisition initiatives. Campaign analytics is an integral part of the overall web analytics process because a major part of the online budget is invested in getting visitor traffic to the website. At Outsource2india, we can help you optimize your marketing campaign spend by tracking and measuring the effectiveness of the campaigns at the broad level to the key phrase or creative level. Web visitor traffic acquisition types can be categorized into offline and online channels. The following is a list of offline and online channels: Online channels:PPC, Banners, PR/social media initiatives, Emails and affiliate programs Offline channels:Television, Radio, Print, POS advertisements, Direct mails, OOH media etc
Return Code	The return status of the request which specifies whether the transfer was successful and why.
Return Frequency	SiteCatalyst report that shows the number of days between repeat visits from your visitors.
Return on Investment (ROI)	(Revenue Cost) / Cost, expressed as a percentage. For example, if an investment of $150 was made for advertising, and led to $500 in sales, the ROI would be: ($500 $150) / $150 = 2.33 or 233%
Return Visitor	A visitor who can be identified with multiple visits, either through cookies or authentication.
returning (returning visits, returning campaign visits)	Number of returning visits to your website from visitors who first visited the site though a campaign and then returned, regardless of whether the return visit was campaign or non-campaign generated. Any returning visits will be counted for the campaign that generated the visitors first visit.
Returning Sessions	Returning Sessions represents the number of times unique visitors returned to your website during a specified time period.
Revenue	In versions of Google Analytics that support e-commerce reporting, the term Revenue is used in place of whichever local currency is being used, since Google Analytics supports currencies other than the US dollar. Revenue tabs appear on several reports as a data display option when appropriate.
Reverse DNS	Name resolution software that looks up an IP address to obtain a domain name. It performs the opposite function of the DNS server, which turns names into IP addresses.
Revshare / RevenueSharing	A method of allocating per-click revenue to a site publisher, and click-through charges to a search engine that distributes paid-ads to its context network partners, for every page viewer who clicks on the content sites sponsored ads. A type of site finders fee.
RF-ID tags	Radio Frequency Identity Tags. Smart tags on stock items that can be read by handheld devices for monitoring stock movements.
RFM Analysis	Recency, Frequency, Monetary analysis.
RichMedia	Media with embedded motion or interactivity. A growing option for PPC advertisers as rates of broadband connectivity increases.
Right Of Withdrawal	The period within which a consumer can legally withdraw, or change their mind, about a purchase made or contract entered into. A new European directive defines this period as seven days.
RightNow Technologies	Organization with whom Omniture partners in order to help track customer service.
RightRail	The common name for the right-side column of a web page. On a SERP, right rail is usually where sponsored listings appear.
ROAS (Return on ad spend)	A determination of the effectiveness of your ad spending. (Sales / ad costs)
ROAS	: Return on Advertising Spending
Robots.txt	A text file present in the root directory of a website which is used to direct the activity of search engine crawlers. This file is typically used to tell a crawler which portions of the site should be crawled and which should not be .
ROI	(Return on Investment) (Revenue Cost)/ Cost, expressed as a percentage.
Role-based Authentication	Sawmill supports role-based authentication, allowing you to control in detail what your Sawmill users have permission to do. You can create roles with varying degrees of permissions (e.g., permissions only to view reports for specific profiles; or permissions to edit but not delete Log Filters), and assign each user to one or more roles.
ROMI	Return on Marketing Investment
Rotating Log Files	Rotating log files is the process of ceasing to write to a particular log, renaming the file (usually by appending the date), possibly moving it to another directory, and then instructing the web server software to open a new log file for writing. The primary reason to do this is to keep the size of log files in check and to ensure that Urchin has processed all available data. To actually complete the log rotation, some web servers, like IIS, must be restarted (stopped/started).
RSA Encryption	A public-key encryption technology developed by RSA Data Security, Inc, based on the fact that there is no efficient way to factor very large numbers. Deducing an RSA key, therefore, requires an extraordinary amount of computer processing power and time. The RSA algorithm has become the de facto standard for industrial-strength encryption, especially for data sent over the internet. It is built into many software products, including web browsers.
RSS Aggregator	A client software that uses web feed to retrieve syndicated web content such as blogs, podcasts, vlogs, and mainstream mass media websites, or in the case of a search aggregator, a customized set of search results.Such applications are also referred to as RSS readers, feed readers, feed aggregators, news readers or search aggregators. These have been recently supplemented by the so-called RSS-narrators [such as TalkingNews or Talkr] which not only aggregates news feeds but also converts them into podcasts
RSS	Really Simple Syndication is a type of web syndication used by news sites and weblogs which provides summaries of information with links to the complete content. (Note: Webtrends offers anRSS feedfor its Resource Center.)
s_account Variable	The s_account variable determines the report suite where data will be stored and reported in SiteCatalyst. If sending to multiple report suites (multi-suite tagging) s_account may be a comma-separated list of values.
Sampling	In statistics, the selection of individual observations intended to yield knowledge about a population, especially for the purposes of statistical inference.
San Jose Data Center	One of Omnitures centers in San Jose, California, where web analysis data is collected and stored.
Sandwich Pattern	Pattern in the Pattern Builder that enables you to analyze the page(s) between two pages
Scalability	The ability of a computer application or product or indeed a website to continue to function well (or even better) as it is altered in size to meet customers needs.
Scalable	Quality of an implementation that allows it to grow as the usage of the service increases.
Scenario Analysis	A report showing activity at each step of a pre-defined scenario.
Scheduled Report	A scheduled report is a SiteCatalyst report that is sent to you electronically (usually via email or FTP) based on metrics that you can select through a wizard in the SiteCatalyst interface.
score	Determines how intensely a label should be applied to a site. The score, which is applied to an individual annotation, tempers or reverses the influence of the weighted labels. It adds another layer of granularity to the fine-tuning of the ranking. To learn more, seeChanging the Ranking of Your Search Results.
Screen Resolution	The size of the monitor and/or screen display settings.
Script	SeeJavaScript.
Search Analytics	Analyzing search terms and behavior of visitors using the website search engine.
Search element	An object that you can embed in your webpage. It shows the search box and search results together in the same webpage that the reader is viewing. When your users do searches, they are not taken to another webpage, unless they click links in the results section. If your users remain in the same page, they can close the results section and resume reading the webpage. See alsoWeb Elements.
search engine label	A label that Custom Search uses to associatecontextspecifications withannotations. The context file includes labels that identify the search engine, and you tag each site in the annotations file with the search engine labels. Custom Search displays search results according to how sites had been annotated with these labels. Search engine labels are also calledbackground labels, becausein contrast torefinement labels,which are visible to the users and appear in the results pagesearch engine labels run in the background and are invisible to users. To learn more about search engine labels, seeChanging the Ranking of Your Search Results.
Search Engine Marketing (SEM)	:The act of marketing a web site via search engines, whether this be improving rank in organic listings (search engine optimization), purchasing paid listings (PPC management) or a combination of these and other search engine-related activities (i.e. affiliate programs, shopping feeds or link development).
Search Engine Marketing Public Relations (SEM PR)	The art of leveraging traditional PR materials to increase visibility and traffic via a hybrid of interactive PR strategies & tactics, including SEO, PPC and SMO. Tactics may include press release optimization and distribution, article syndication and social media outreach.
Search Engine Optimization (SEO)	The act of altering a web site so that it does well in the organic, crawler-based listings of search engines. In the past, has also been used as a term for any type of search engine marketing activity, though now the term search engine marketing is more commonly used as an umbrella term.
Search Engine Positioning (SEP)	Synonymous with SEO, search engine positioning is the act of altering a web site to perform well in organic or natural search results
Search Engine Results Page (SERP)	A page of results generated by search engines based on weighted elements in each engines algorithm. Each page typically consists of 10 URLs, with no more than 2 URLs per domain.
search engine specifications	See alsocontext file.
Search Engine Submission	The act of submitting specific URLs to popular search engines like Google, MSN and Yahoo! to ensure the web page gets spidered and indexed.
Search Engine	A Search Engine is a program that searches documents for specified keywords and returns a list of the documents in which those keywords were found, often ranked according to relevance. Although a search engine is really a general class of programs, the term is often used to specifically describe systems like Google that enable users to search for documents on the World Wide Web.
Search Personalization	The ability to personalize SERPs based on personal profile information, settings or location (IP address).
Search phrases and Search words	Phrases and words that visitors typed into search engines to come to your website. These help you to target the right keywords on your website.
Search Terms	The words (or phrase) a searcher enters into a search engines search box. Also used to refer to the terms a search engine marketer hopes a particular page will be found for. Also called keywords, query terms or query.
Search Words	This report shows the search words and phrases that were used in search engines to reach your site. As this list grows, you will want to add missing words to your META KEYWORD tag on each page to increase search engine hits.
Secondary Server Calls	In multi-suite tagging, it is the second call to the Omniture servers. Any subsequent calls to the Omniture servers are charged at a price determined in the clients contract. For example, if s_account=abc,123 the call to 123 is the secondary server call.
Secure Server	A computer that handles encrypted data for secure transactions so any communications are kept private, e.g. when sending credit card details to an e-retailer.
Segment Definition Builder	SiteCatalyst tool that allows you to define visitor segment filters based on visitor behavior. When submitting a Data Warehouse Request, you can apply a segment filter to the result set, which can be a significant advantage in analyzing your web site traffic.
Segment Wizard	SiteCatalyst wizard that enables you to create segment definitions. See Segment Definition Builder.
Segment	A grouping of customers, defined by website activity or other data, which can be used to target them effectively.
Segmentation	The process of dividing data and putting it into categories for easy analysis.
Segmented	A subset of the site traffic for a defined period of time, filtered in some way to gain greater analytical insight: e.g., by campaign (e-mail, banner, PPC, affiliate), by visitor type (new vs. returning, repeat buyers, high value), by referrer.
SEM	Search Engine Marketing is a means to increase the visibility of a website in search engine results pages.
SEO Management	See Search Engine Optimization
SEO	Search Engine Optimization is the improvement of rankings for relevant keywords in search results by adjusting website structure and content to make them more easily read and understood by a search engines software programs.
Seriousness of usage	Assume you need some important information from the web, will you prefer your desktop or the mobile? The answer will mostly lean towards the former. This directly reflects the behavior of most internet users; we are not challenging the use of mobiles for collecting serious information but merely suggesting upon the preferences of a user when looking for serious information. This behavior implies that the quality of browsing on the desktop platforms is higher compared to the mobile platform. The point here is the sustained focus of the user, which is more on desktops rather than mobiles.
Server Error	A fault occurring at the computer hosting information. See alsoreturn code.
server Variable	The server Variable is used to show either the domain of a web page (to show which domains people come to) or the server serving the page (for a load balancing quick reference). The server Variable is used to populate the Most Popular Servers Report in SiteCatalyst.
Server	A computer program that provides services to other computer programs in the same or other computers. When taken in the context of the World Wide Web, a server serves Web pages to the requesting computer.
Service Provider	SeeISP.
Session Duration	Average amount of time that visitors spend on the site each time they visit. This metric can be complicated by the fact that analytics programs can not measure the length of the final page view.
Session	A specific visit to a website that ends when the user has taken no further action after a given period of time usually 30 minutes, indicating he or she is no longer at the site; sessions are also referred to as visits.
Sessionization	This is the process for creating a session. Sessionization methods are ways in which you can define a session. Web Analytics solutions have multiple sessionization methods such as cookies, IP Address, IP+ Agent and so on. These methods tell the web analytics system how they should count a series of page requests from the same individual or browsing machine.
Set analytical goals	Drawing clear analytical goals and then using tools, which are temporarily limited in capabilities, for capturing more accurate user information should be your mantra. This alone will get you some decent and actionable visitor information. So the key here is to set your analysis goals straight and smart and in alignment with the tools capabilities.
SET	Secure Electronic Transaction. A standard that enables secure credit card transactions on the internet. SET has been endorsed by virtually all the major players in the electronic commerce arena, including Microsoft, Netscape, Visa, and MasterCard. SET provides secure communication of credit card numbers to card issuers.
setClickMapEmail	Provides the power of Omniture ClickMap in HTML emails to provide insight into what links are most commonly clicked.
Settlement	The process by which money is transferred between a merchant and a cardholder.
Shareware	A shareware program is one you can try free of charge, though often for a limited period, or with certain features disabled. A registration fee is usually payable to continue using it. See alsoFreeware.
Shell Archive	A shell archive is a collection of files that can be unpacked by using the Unix Bourne shell command interpreter /bin/sh.
Shipper (Carrier / Transporter)	Organisation that handles all aspects of the delivery of physical goods.
Shop Window Site	A site that advertises a companys business and provides information about its products and services, but does not have payment facilities to sell products directly.
Shopping Basket	SeeShopping Cart.
Shopping Search/Feeds	Shopping search engines allow shoppers to look for products and prices in a search environment for rapid and easy comparison. Premium placement can be purchased on some shopping search indices via XML feeds.
Single Access	SiteCatalyst report that shows you the pages of your web site that visitors enter and exit without taking steps to view any other pages on your web site.
Single-page Visit	The pages of a web site that visitorsenter and exit without taking steps to view any other pages on the web site.
Singletons	The number of visits where only a single page is viewed. While not a useful metric in and of itself the number of singletons is indicative of various forms ofclick fraudas well as being used to calculate bounce rate and in some cases to identify automatons bots.
Site Domains	Site Domains are all the valid domains (URLs) that point to a given websites. For example, the Site Domains for google.com are: www.google.com, and google.com.
Site Optimization	The act of fine-tuning web site content and code to perform well in search engine results. See Search Engine Optimization.
Site Overlay	is a technique in which graphical statistics are shown beside each link on the web page. These statistics represent the percentage of clicks on each link.
Site Search	The business edition of Custom Search. Google Site Search lets you create search engines that do not include ads, remove Google branding (if you so choose), and have more control over how the results are presented to your users, among other things. To learn more, see theGoogle Site Search page.
Site Sections Depth	SiteCatalyst report that identifies the depth at which each page within your site is visited. Depth for a page is measured by counting the number of pages viewed before that page. So, if your About Us page is the third page visited by a given visitor, its depth for that visit is three.
Site Sections Summary	SiteCatalyst report that collects and organizes page-specific information about a single page and presents it in a single report.
Site statistics is suitable for	Sites with low traffic, sites with minimum content, outreach sites etc
Site Statistics	Get basic metrics and information on how people are using your website. You can also track visitors around your site. Top Referrers and Keywords Most Popular Content Geographic Information about the audience of your site
Site Traffic	Metrics that report the number of visitors to your web site based on daily, weekly, monthly, and yearly time frames.
site	Website, webpage, orURL patterndefined in yourannotations file.
Sitemap	AnXMLfile that lists pages on your website. It could also include information about the webpages, such as when they were most recently updated, how frequently they change, and how important they are in relation to each other. Submitting a Sitemap of your website toGoogle Webmaster Centralhelps Google discover pages on your site, including those that Google could not find with the normal crawling process. To learn more about the XML schema, see theUsing the Sitemap Protocolpage for Webmaster Tools.
SKU	Stock Keeping Units
SmartSource	A trademarked technology from Webtrends. SmartSouce Data collection offers an alternative to traditional web server log file analysis, collecting information directly from the visitors browser (the client) rather than from server log files, improving data accuracy.
SmartView	Webtrends SmartView is an easy-to-use visual overlay of web metrics displayed right on a web page, which you can use to analyze page performance, providing insight into page conversion, path analysis, and overall web page statistics such as unique visitor counts.
Smoothing Technique	SiteCatalyst tool that displays a graphical representation of how a metric performs over time.
snippet	A search result in the results page. It shows a small sample of content of the webpage. To learn more, see the Webmaster Centralblog poston the anatomy of a search result.
Social Media Marketing (SMM)	Social media is the in-thing now and all marketers are falling head over heels for it. Twitter, Friendster, Digg, Facebook and a whole lot of popular social networking and social bookmarking sites have facilitated online marketing and branding in a revolutionary style. Outsource2indias social media optimization services (SMO services) will help you create and manage your online brand efficiently.
Social Media Monitoring & Analysis	The process of monitoring and analyzing data generated by social media and related marketing and optimization efforts.
Social Media Optimization (SMO)	A set of methods for generating publicity through social media, online communities and community websites. Methods of SMO include adding RSS feeds, adding a Digg This button, blogging and incorporating third party community functionalities like Flickr photo slides and galleries or YouTube videos. Social media optimization is a form of search engine marketing.
Social Media	An umbrella term that defines the various activities that integrate technology, social interaction, and the construction of words and pictures. This interaction, and the manner in which information is presented, depends on the varied perspectives and building of shared meaning, as people share their stories, and understandings.
Software	The programs, routines, and symbolic languages that control the functioning of the hardware and direct its operation. Written programs or procedures or rules and associated documentation pertaining to the operation of a computer system and that are stored in read/write memory.
Source (Campaign Tracking)	In the context of campaign tracking, a source is the origin of a referral. Examples of sources are the Google search engine, the AOL search engine, the name of a newsletter, or the name of a referring web site. The UTM variable for source is utm_source.
Source	Also know as source code. The actual text and commands stored in an HTML file (including tags, comments, and scripts) that may not be visible when the page is viewed with a web browser.
Spam	Any search engine marketing method that a search engine deems to be detrimental to its efforts to deliver relevant, quality search results. Some search engines have written guidelines about what they consider to be spamming, but ultimately any activity a particular search engine deems harmful may be considered spam, whether or not there are published guidelines against it. Examples of spam include the creation of nonsensical doorway pages designed to please search engine algorithms rather than human visitors, or a heavy repetition of search terms on a page to increase keyword density. Also referred to as spamdexing.
Spider	An automated software program that gathers pages from the Internet.
SSL Certificate	An attachment to an electronic message used for security purposes. The most common use of a digital certificate is listed below. To verify that a user sending a message is who he or she claims to be To provide the receiver with the means to encode a reply An individual wishing to send an encrypted message applies for a digital certificate from a Certificate Authority (CA). The CA issues an encrypted digital certificate containing the applicants public key and a variety of other identification information. The CA makes its own public key readily available through print publicity or perhaps on the Internet. The recipient of an encrypted message uses the CAs public key to decode the digital certificate attached to the message, verifies it as issued by the CA, and then obtains the senders public key and identification information held within the certificate. With this information, the recipient can send an encrypted reply
State Variable	The state variable tracks the U.S. state in which a visitor is located.
Status Code	A status code, also known as an error code, is a 3-digit code number assigned to every request (hit) received by the server. Most valid hits will have a status code of 200 (ok). Page not found errors will generate a 404 error. Some commonly seen codes are in shown below in bold . 100 Continue 101 Switching Protocols 200 OK 201 Created 202 Accepted 203 Non-Authoritative Information 204 No Content 205 Reset Content 206 Partial Content 300 Multiple Choices 301 Moved Permanently 302 Moved Temporarily 303 See Other 304 Not Modified 305 Use Proxy 400 Bad Request 401 Authorization Required 402 Payment Required 403 Forbidden 404 Not Found 405 Method Not Allowed 406 Not Acceptable 407 Proxy Authentication Required 408 Request Time-Out 409 Conflict 410 Gone 411 Length Required 412 Precondition Failed 413 Request Entity Too Large 414 Request-URL Too Large 415 Unsupported Media Type 500 Server Error 501 Not Implemented 502 Bad Gateway 503 Out of Resources 504 Gateway Time-Out 505 HTTP Version not supported
Stickiness	A websites capability to retain visitors, measured as number of pages visited per session and time spent on website.
stored custom search engine	Search engine whose specifications are hosted by Google. Any search engine that is created through the wizard or uploaded in the control panel is called stored custom search engines. In contrast,linked custom search enginesare hosted in other websites.
Strong Password	Security option in SiteCatalyst that prevents users from selecting passwords that are easily guessed. Enabling this option will prevent users from choosing simple passwords.
structured data format	Specifications for tags that make data more meaningful to machines. Custom Search can read microformats,RDFa, and PageMaps. To learn more, seeCustomizing Result Snippets.
structured data	Semantic metadata that describes the content of the webpage. They are text snippets embedded in the page to give information that might be more meaningful to machines. To learn more, seeCustomizing Result Snippets.
Submission	The act to submitting a URL for inclusion into a search engines index. Unless done through paid inclusion, submission generally does not guarantee listing. In addition, submission does not help with rank improvement on crawler-based search engines unless search engine optimization efforts have been implemented. Submission can be done manually (i.e., you fill out an online form and submit) or automated, where a software program or online service may process the forms behind the scenes.
Suffix	The last part of adomainthat can be used to identify the type of organization or location of a site.
Summary Report	Present an executive overview of assorted conversion information. Each reporting section uses at least one Conversion Summary report, which presents a synopsis of the information contained in the other reports in that group. SiteCatalyst Conversion employs three distinct versions of the Conversion Summary report: Conversions & Averages, Most Improved and Item Summaries.
synonyms	Variants of a query term. For example, cd could have the following related terms: certificate of deposit, fixed-income instrument, and fixed cashflow. To learn more, seeImproving User Queries For More Relevant Results.
T1	A dedicated Internet connection that supports data transfer rates of up to 1.54 megabits per second.
Tabular View	SiteCatalyst report view that can help you view current performance for your site, performance trends or the portions that have had the greatest or least improved performance for any time period.
Tag	Seepage tag.
Taguchi Metho	Mathematical theory used in Multivariate Testing.
Target	SiteCatalyst feature that enables you to see, graphically, how your site is performing based on set goals (or targets).
Task	A Task is a log-processing event of any type programmed into the Scheduler. Tasks can be set to execute at virtually any frequency desired, but are generally set to run at a daily interval.
TCP/IP (Transmission Control Protocol/Internet Protocol)	Represents the suite of communications protocols used to connect hosts to the Internet. TCP/IP is the de facto standard for transmitting data over networks. Even network operating systems with their own protocols, such as Netware, support TCP/IP.
Technical glitches	Java script page tagging, which essentially notifies third party servers when web pages are rendered, is not properly aided in most low-end mobile phones making it difficult to identify the site visitors. Sometimes the mobile user may unknowingly disable these Java scripts thereby disabling identification. Similarly, HTTP cookies which are generally used for gathering visitor metrics are also not supported by all mobile handsets thus making it difficult to decipher visitor information. However, the latest smart phones are built with more capable features to enable accurate analytics.
Telnet	A terminal emulation program for TCP/IP networks which runs on your computer and connects to servers on the network. The Internet is a large TCP/IP network. A telnet session is initiated as the user types in the host and the hosts port and enters a valid username and password. Once connected the telnet user can run commands as if he/she were physically at that server.
Term	(Campaign Tracking) In the context of campaign tracking, term refers to the keyword(s) that a visitor types into a search engine. The UTM variable for term is utm_term. Term is one of the five campaign dimensions; the other four are source, medium, content, and campaign.
Text Ads	Online advertisements that do not contain graphic images. Typically, they are just textual links to other web pages or web sites.
The customer requiremen	Our web analytics customer wanted segmentation that was both specific and complex.
The customer requirement	Our customer, a digital agency required a method by which they could easily publish online data, so that they can save themselves the trouble of sending reports to their customers in the form of Excel/PDF formats. With an online dashboard, their customers could log in and see the data whenever required.
Third-party cookie	Hosted web analytics services track visitor behavior by inserting a small piece of tracking code onto each page of a site. Because the cookie is served by an analytics vendor rather than your own site, the cookie is considered third-party.
Third-party Cookies	Third party cookies are left on your machine by a domain other than one that you are currently viewing. See Cookies.
Threshold	The value of a metric above which an element is of interest. Typically, below a threshold, metrics do not correlate with real effects, and code elements below threshold usually do not require code to be reviewed or modified.
Time Offset	Is the amount of time that a server logging in Greenwich Mean Time (GMT) needs to be adjusted to arrive at the correct local time for the site.
Time Slice	In the ASI Segmentation Wizard, the Time Slice represents a clearly defined date range. The resulting segment will only contain data between the From and To dates.
Time spending pattern	Users basically are inclined to spend more time browsing on their desktops or laptops rather than their fully loaded smart phone. This has more to do with the ergonomics. As mobile platforms are confined to various technical and physical limitations the behavior of users, while browsing, will be to use it shortly and timely and if they are in for some, time consuming activity, they would rather prefer to do it on their desktops which will strain them less.
Time Spent on Page	SiteCatalyst report that displays the amount of time visitors spent on a certain page of the web site.
Time Spent on Site Sections	SiteCatalyst report that displays the amount of time visitors spend in a certain section of the web site.
Time Spent per Visit	SiteCatalyst report that reveals the length of time visitors spend viewing your site as a whole during each visit.
Time Zones	This report shows you the time zones that your visitors are in. This report will help you understand how your visitor base is distributed and if you have a particular customer base in a specific time zone.
Tools
Top exit pages	The pages from which the most people leave your site There you have it a basic analytics glossary to help you better understand the web traffic your site is getting. Hopefully this helps things make more sense to you, if I missed a term you dont understand, let me know in the comments.
Top landing pages	This shows you which of your pages attract the most inbound traffic
Top Level Domains	SiteCatalyst report that shows you the countries where your visitors have come from based on their originating domain.
Top-Level Domain	A Top-Level Domain (TLD) is the last part of a URL or domain name. For instance, the TLD of google.com is .com, and the TLD of google.co.uk is .uk.
Total DATA SECURITY promise	We have strict code and processes to make sure that your data is 100% safe. We sign a Non Disclosure Agreement with our clients and we have secured access systems for authorized personnel only. All our employees sign the Confidentiality Agreement. We have all other security features in place like: dedicated firewall, virus prevention, spam filtering, centralized server, and file system access policy based on user authentication.
Total Unique Visitor Sessions	The total number of Sessions from identified Unique Visitors during the time period ( Date Range ) being analyzed.
trackDownloadLinks Variable	SiteCatalyst variable that enables you to track links to downloadable files on your web site.
trackExternalLinks Variable	SiteCatalyst variable that enables linkInternalFilters and linkExternalFilters to determine whether any link clicked is an exit link.
Tracking Code	The Google Analytics tracking code is a small snippet of code that is inserted into the body of an HTML page. When the HTML page is loaded, the tracking code contacts the Google Analytics server and logs a pageview for that page, as well as captures information about the visit and non-identifying information about the visitor.
trackInlineStats Variable	SiteCatalyst variable that determines whether ClickMap data is gathered or not.
trackYahooStores	Plug-in to automatically track purchases through an integrated Yahoo Store. This plug-in will capture the products, revenue, units and orders on the final confirmation page.
Traffic Navigator	SiteCatalyst page that displays the Traffic reports that are available to your given implementation.
Traffic sources	This tells you how visitors get to your site, providing numbers for each of three methods;
Traffic Variable	Custom Insight Traffic Variable (or prop) enables you to correlate custom data with specific traffic-related events. The prop variables are embedded in the SiteCatalyst code on each page of your website
Traffic	On the web, traffic refers to the amount of data sent and received by visitors to a website.
Transaction Unique Customer URL	Any process set by the Web site owner that begins with an order variable and ends with a success variable. This could mean a product purchase, newsletter signup or e-mail request for information after going through a preset process. A unique customer is registered when a person makes a purchase from your site for the first time within a specified period of time. In other words, while one person may buy from your site three times, this person would be recorded as one unique customer so you can tell exactly how many individual people are purchasing from your site. There are five different time frames SiteCatalyst uses to define unique customers: daily, weekly, monthly, quarterly and yearly. A daily unique customer may purchase from your site twice on February 7th, and then again on February 8th. This customer will register as two daily unique customers, because only the first purchase on the 7th and the purchase on the 8th would count as unique purchases for their respective days. This same standard is used to determine monthly and yearly unique customers for their respective time frames. It canbe helpful to ask the following question to see how SiteCatalyst determineswho is a unique customer: How many different people purchased from mysite during this time period? NOTE: The sum of all daily unique customers isnot equal to the total monthly unique customers for that month. This isbecause a customer who purchases twice in a given month will count as twodaily unique customers one for each day they purchased but only as asingle monthly unique customer. The same relationship holds true for monthlyunique customers and yearly unique customers. The method used to give anaddress to documents and other resources on the World Wide Web. The firstpart of a URL indicates what protocol to use, and the second part specifiesthe IP address or the domain name where the resource is located. Forexample, the first section of the URL http://www.Omniture.com specifies thatthe Web page that should be fetched using the HTTP protocol. The secondsection directs the main page in the Omniture domain to be retrieved.
transliteration	Transcription of words in English alphabet into another writing system. Transliteration converts words you enter in English into their phonetic equivalents in another script, such as Arabic or Hindi.
Trending	Report view that gives you an opportunity to view report trends over a given period of time in order to identify data patterns.
True INSIGHTS for real challenges	Growing companies do not need more data; they need answers and solutions to their problems. We have moved far above the traditional reporting format. We offer deep implementable insights for optimization of all online marketing initiatives of our clients. Our clients have been able to define and achieve their digital marketing objectives and drive competitive advantage.
True Real-time reporting	Sawmill can be configured to provide true real-time reportingup-to-the-second reporting on the current contents of your log files. There is no need for explicit database refreshes, and no need to wait for the log data to finish loading into the database before viewing reports from the latest data.
TSV file	Tab Separated Value file, a plain-text file that includes lines of fields (strings of characters) that are separated from each other by single tab stops. You can use a simple text editor or a spreadsheet editor to create and edit a TSV file. Just save the text file with the file extension .tsv (for example, cse_bicycles.tsv).
UI	User interface. See also Graphical User Interface
Understanding the Different Types of Visitor	People come to your site for different reasons and at different stages of the purchase process. You need to understand this and ensure your site caters for each if you want to maximise online sales. Different Types of Website Visitor
Undivided FOCUS on web analytics	Our core area is web analytics and that is what we specialize in. Over the years, we have gained vast knowledge and insights in various analytics technologies and techniques. With hundreds of websites under our constant purview, we evolve continuously with the new challenges that crop up daily in the online analytics world. Our certified experts can work on all major web analytics tools and make the most of their latent features.
Unique Customer	Identifies the number of different people that make purchases from your site during different time frames. There are five different Unique Customers Reports daily, weekly, monthly, quarterly, and yearly.
Unique IP Addresses	The number of unique IP addresses that visited the site in question during the given time frame. This is not to be confused with unique users. However, the count of unique IP addresses can serve as a lower bound for the number of unique users. IP addresses normally provide a fairly low count for unique users due to the fact that most mega proxies (AOL, MSN, ect) mask all of their users behind a single IP address and therefore appear as one user.
Unique Users	An unduplicated count of all individually identified machines that made a visit to a selected domain during a given analysis period.
Unique Visitor Session	A Unique Visitor Session is a quantity of visitor interaction with a website for which the visitor can be tracked and declared with a high degree of confidence as being unique for the time period being analyzed.
Unique Visitor	Unique visitors represent the number of unduplicated (counted only once) visitors to your website over the course of a specified time period. A unique visitor is determined with cookies.
Unique Visitors	Unique Visitors represents the number of unduplicated (counted only once) visitors to your website over the course of a specified time period. A Unique Visitor is determined using cookies.
UNIX	is a computer operating system originally developed in the 1960s and 1970s by a group of AT&T Bell Labs employees including Ken Thompson, Dennis Ritchie, and Douglas McIlroy.
Untraceable Session	A period of visitor interaction with a website for which the visitor cannot necessarily be distinguished as unique or not.
Untrackable Session	A period of visitor interaction with a website for which the visitor cannot necessarily be distinguished as being unique.
URL pattern	A group of URLs that match a given pattern. For example,*.google.com/is a pattern for all subdomains (such ascode.google.comandimages.google.com) under google.com . To learn more, see theCustom Search Help Center.
URL	Uniform Resource Locator is a means of identifying an exact location on the Internet. For example, http://www.googleanalytics.com/support/platforms.html is the URL that defines the use of HTTP to access the web page platforms.html in the /support/ directory on the Google Analytics website. URLs typically have four parts: protocol type (HTTP), host domain name (www.googleanalytics.com), directory path (/support/), and file name (platforms.html).
Usability Testing	Usability-testing is the measurement of how well a website aligns with the behaviours of online users, enabling them to complete their tasks efficiently, effectively, and satisfactorily.
Usenet	A worldwide bulletin board system with over 14,000 newsgroups.
usePlugins Variable	Determines if the doPlugins function is run or not.
User Agent	A user agent is a generic term for any program used for accessing a website. This includes browsers (such as Internet Explorer or Netscape), robots and spiders, and any other software program that acts as an agent for a someone or something seeking information from a website.
User Session	A period of activity (all hits) for one user of a website. A unique user is determined by the IP address or cookie. Typically, a user session is terminated when a user is inactive for more than 30 minutes. See alsovisit.
User	A person who accesses a website; a user might be responsible for multiple visits to the site over a period of time, or make multiple visits during one session.
Username	A Username is name used to gain access to a computer system. Usernames, and usually passwords, are required in multi-user systems. In most such systems, users can choose their own usernames and passwords.
UV	Unique Visitors refers to a measure captured by some web analytics solutions that track the interaction a single user has with a website over time.
Variable Truncation	Often refers to the shortening of a variable or an image request. If the number of characters in a variable or image request exceed the allowed number of characters, the characters exceeding the limit will be automatically removed, or truncated.
Variable	Variables are contained in both the SiteCatalyst Code to Paste and the JavaScript file. Their primary purpose is to send values to SiteCatalyst and to control JavaScript execution.
Vendor	An organization that sells technology and/or services to another organization.
Very Fast	Since Sawmill generates a new report every time you click the mouse, it has been heavily optimized for speed. Most pages load in less than five seconds, so you wont be waiting for your statistics. There is no limit to the amount of data Sawmill can analyze, and even really huge datasets (gigabytes of log data) can be browsed in real time. Sawmill uses multiple levels of caching to ensure that once something has been computed, it will re-use the computation when possible, for highest performance.
View Total	The View Total is the tally of items currently shown in the report. This total does not include items that are not shown. For example, if the report in question is showing 10 items out of 45, the View Total number represents the total for only the 10 items shown. Below the View Total listing is the Total, which represents the tally of all items in this report for this Date Range .
Viral Marketing	Any marketing technique that induces Web sites or users to pass on a marketing message to other sites or users, creating a potentially exponential growth in the messages visibility and effect.
Visibility time	The time a single page (or a blog, banner) is viewed.
Visit Depth	The depth to which customers to your site browse. For example, if a customer views three pages on your site before making a purchase, that visit depth would be three.
Visit Length	See Time Spent per Visit
Visit	A page request or a series of page requests by a visitor to a given domain. If, after the initial page request occurs and 30 minutes elapses without a subsequent page request, the visit session is closed. A new visit session is opened upon the next page request to the given domain.
Visit-based Cookies	The visit-based (or session) cookie is a type of persistent cookie, but it offers an additional level of security over the basic persistent cookie because the visit-based cookie is deleted when your session ends.
Visitor / Unique Visitor / Unique User	The uniquely identified client generating requests on the web server (log analysis) or viewing pages (page tagging) within a defined time period (i.e. day, week or month). A Unique Visitor counts once within the timescale. A visitor can make multiple visits. Identification is made to the visitors computer, not the person, usually via cookie and/or IP+User Agent. Thus the same person visiting from two different computers will count as two Unique Visitors. Increasingly visitors are uniquely identified by Flash LSOs (Long Storage Objects), which are less susceptible to privacy enforcement. % Exit:The percentage of users who exit from a page. - end official glossary- There are some bits of jargon outside the more official terms that you may encounter and find handy to know
Visitor Detail	SiteCatalyst report that shows visitor information for the last visitors to your site. Each visitor is defined by IP address. Information collected for each visitor is presented in an easy to read table with detail for five visitors listed on each page.
Visitor Numbe	SiteCatalyst report that helps you gauge visitor loyalty by tracking the number of times each visitor visits your site. During your selected time period, you can see whether more of the visits were from visitors that came to your site for the first time or the 20th time.
Visitor Retention	See Return Visits
Visitor Segmentation	The process of segregating and studying visitors based on various behavior patterns.
Visitor Session	A Visitor Session is a defined period of interaction between a Visitor (both unique and untrackable visitor types) and a website. The definition of a Session varies depending on the type of visitor tracking employed.
Visitor Sessions	Visitor Sessions represents the number of times individual users visited your website over the course of a specified time period. This is a sum of First-time, Returning, and Unknown Sessions.
Visitor	Similar to unique visitor, visitor refers to an individual that visits a website. A visitor or unique visitor can have multiple visits.
visitorNamespace Variable	This variable is used to identify the domain with which cookies are set.
Visitors Total	Visitors is the number of Total Unique Visitors plus the number of Untrackable IP-based Visitors, which represents all individual visitors to your website over the course of a specified time period.
visitorSampling Variable	visitorSampling is the percentage of visitors to your site that are tracked via SiteCatalyst. If you would like to track 10% of the visitors to your site, just set visitorSampling to 10.
visitorSamplingGroup Variable	visitorSamplingGroup is an optional variable used to determine the sampling group being tested by SiteCatalyst.
Visits	Measures the number of people who come to your site. Generally measured within a 30 minute time span from the first visit, so if a visitor visits your site, goes away and comes back 35 minutes later and clicks around, his activity would be considered 2 visits.
Visits/Sessions	A series of hits and page views from one particular IP address. A session has a timeout period, i.e. if there are no hits for a particular time period (usually 30 mts), the next hit is considered as the next session.
VISTA	VISTA is an acronym for Visitor Identification, Segmentation, and Transformation Architecture. This proprietary Omniture technology uses VISTA rules to create real-time segmentation of all online data.
VOD	Video on Demand
W3C	The W3C, or World Wide Web Consortium, is a standards body dedicated to ensuring interoperability between all the varied system and network types that comprise the World Wide Web part of the Internet. The W3C log format is commonly used by several web server software systems, such as Microsoft IIS. For more information, see the W3C website .
Warehouse	Seedata warehouse.
Web 2.0	The use of World Wide Web technology and web design that aims to facilitate creativity, information sharing, and, most notably, collaboration among users. These concepts have led to the development and evolution of web-based communities and hosted services, such as social-networking sites, wikis, blogs, and folksonomies.
Web Analysis Data	The data collected from your web site or web application in order to study user activities and accomplish three tasks: 1) to understand how well the site fulfills its objectives and meets business and user requirements, 2) to seek ways to optimize it to become more usable, relevant, and efficient, and 3) to maximize the return on investment.
Web Analytics Association	Site for the only industry association related to web analytics. Lists jobs, articles and more.
Web Analytics Consulting	conversion rate improvement, persuasive architecture, development of internal analytics processes. I have been providing bespoke web analytics solutions for websites large and small since 2000. During this time I have developed a unique expertise in their use. I can use my understanding of web analytics to improve your online performance. Or I can help your organisation develop its own web analytics capabilities.
Web Analytics Demystified	Home of the Web Analytics Demystified book, this site has evolved into a source for other Web Analytics information. Eric Peterson is a leading author, speaker and industry representative for the Web Analytics community.
Web Analytics Development	If youre running your own dynamic web system, using tools like Microsoft Content Management Server, Macromedia Cold Fusion, or even CICS/Web, chances are your best web analytics solution will be to create your own web analytics system within the framework of your existing architecture. However (as I have discovered the hard way), database and processing structures for web analytics are not self-evident. I can provide your staff with the expertise they will need to create a successful web analytics system one which meets your managerial requirements and which offers suitable performance. Avoid lengthy project delays get the benefit of guidance from someone whos been there before.
Web Analytics Knowledge	Complete resource site for all things related to web analytics. Includes web analytics dictionary, articles, discussions, job listings, web analytic product reviews and more.
Web analytics	is the measurement, collection, analysis and reporting ofinternet datafor purposes of understanding and optimizing web usage.[1] Web analytics is not just a tool for measuring website traffic but can be used as a tool for business research and market research. Web analytics applications can also help companies measure the results of traditional print advertising campaigns. It helps one to estimate how traffic to a website changes after the launch of a new advertising campaign. Web analytics provides information about the number of visitors to a website and the number of page views. It helps gauge traffic and popularity trends which is useful for market research. There are two categories of web analytics;off-siteandon-siteweb analytics.Off-site web analytics refers to web measurement and analysis regardless of whether you own or maintain a website. It includes the measurement of a websitespotentialaudience (opportunity), share of voice (visibility), and buzz (comments) that is happening on the Internet as a whole.On-site web analytics measure a visitors journey onceon your website. This includes its drivers and conversions; for example, whichlanding pagesencourage people to make a purchase. On-site web analytics measures the performance of your website in a commercial context. This data is typically compared againstkey performance indicatorsfor performance, and used to improve a web site or marketing campaigns audience response.Historically, web analytics has referred to on-site visitor measurement. However in recent years this has blurred, mainly because vendors are producing tools that span both categories.
Web Analytics	The measurement of data as it relates to an Internet site, including the behavior of visitors, the amount of traffic, the conversion rates, web server performance, user experience, and other information in order to understand and proof of results and continually improve the results of a site towards a set of objectives.
Web Beacon	A web beacon, also known as a clear.gif, is a transparent graphic image, no larger than 11 pixel, usually placed on a web site or in an email to track visitor behavior. SiteCatalysts web beacon points to a server, known as 2o7.net, to retrieve the image. The image, when loaded into the users browser, loads JavaScript code that performs several functions, one of which is to check for a cookie. If a cookie is not loaded, it loads one on the browser. If the user does not have cookies enabled, a web beacon will not be able to track the users activity. The web beacon for a non-cookied user will account for an anonymous visit, but the users unique information will not be recorded. The JavaScript code also collects variables (particular to the selected web site) from the code and sends them back to SiteCatalyst.
Web Elements	Google product that lets you copy and paste code for displaying Google products, such as Custom Search and Google Calendar onto your own website. To learn more and see which Google products have a Web Element, see theWeb Elementssite.
Web server logfile analysis	Web servers record some of their transactions in a logfile. It was soon realized that these logfiles could be read by a program to provide data on the popularity of the website. Thus aroseweb log analysis software. In the early 1990s, web site statistics consisted primarily of counting the number of client requests (orhits) made to the web server. This was a reasonable method initially, since each web site often consisted of a single HTML file. However, with the introduction of images in HTML, and web sites that spanned multiple HTML files, this count became less useful. The first true commercial Log Analyzer was released by IPRO in 1994[2]. Two units of measure were introduced in the mid 1990s to gauge more accurately the amount of human activity on web servers. These werepage viewsandvisits(orsessions). Apage viewwas defined as a request made to the web server for a page, as opposed to a graphic, while avisitwas defined as a sequence of requests from a uniquely identified client that expired after a certain amount of inactivity, usually 30 minutes. The page views and visits are still commonly displayed metrics, but are now considered rather rudimentary.The emergence ofsearch engine spidersand robots in the late 1990s, along withweb proxiesanddynamically assigned IP addressesfor large companies andISPs, made it more difficult to identify unique human visitors to a website. Log analyzers responded by tracking visits bycookies, and by ignoring requests from known spiders.The extensive use ofweb cachesalso presented a problem for logfile analysis. If a person revisits a page, the second request will often be retrieved from the browsers cache, and so no request will be received by the web server. This means that the persons path through the site is lost. Caching can be defeated by configuring the web server, but this can result in degraded performance for the visitor to the website.
Web Server	This is a vague term whose meaning must be determined by the context in which its used. It will mean one of two things: The physical computer that acts as a server. This is a computer just like any other. It is called a server because its main function is to deliver web pages. Often there is nothing particularly special about a servers hardware, its only a server because of the software.
WebResults	The free monthly newsletter from Webtrends. (Note: You cansubscribeto receive it every month.)
Website URL	A Website URL is the complete address to a website. For example, the complete URL to stratigent.com is http://www.capitalsal.com/.
Weekly Unique Visitor	The number of unduplicated (counted only once) visitors to your website over the course of a single week.
What if	A type of analysis that allows an end-user to pose hypothetical situations against their data to model or predict outcomes.
White Papers	Technical documents used primarily to generate leads for business-to-business technology companies. The technical papers typically include industry research, statistics and deep technical information. Download Anvils SEO White Paper for an example of how its done correctly.
WIN Partner	The Webtrends Insight Network is a select group of leading interactive agencies, marketing consultants and web analytics experts worldwide that work with customers to maximize the success of their online initiatives through the use of Webtrends solutions.
WML	: Website META Language, a free, extensible off-line HTML generation toolkit for UNIX, distributed under the GNU General Public License (GPL v2).
Works With A Variety Of Platforms	Sawmill runs on all major platforms. There are currently pre-built versions for the following platforms: Window (x86 or x64) Linux (x86 or x64) Solaris (SPARC, x86, or x64) FreeBSD (x86) OpenBSD (x86) Source code is also available (obfuscated), soSawmill can be compiled and run on any system with a C++ compiler. If you are interested in seeing Sawmill on any other platforms, send mail to sawmill@flowerfire.com.
World Wide Web	Also called the web, this is a global information space which people can communicate via computers connected to theInternet. Some people use internet and the web interchangeably, even though the web is a service that operates over the internet.
XML Feeds	A form of paid inclusion where a search engine is fed information about pages via XML, rather than gathering that information through crawling actual pages. Marketers can pay to have their pages included in a spider based search index either annually per URL or on a CPC basis based on an XML document representing each page on the client site. New media types are being introduced into paid inclusion, including graphics, video, audio, and rich media. These feeds are commonly used for Shopping Feeds.
XML	: Extensible Markup Language is a World Wide Web Consortium (W3C) recommended general-purpose markup language for creating special-purpose markup languages, capable of describing many different kinds of data.
Yearly Unique Visitor	The number of unduplicated (counted only once) visitors to your website over the course of a single year.
Yellow Dog Linux	Yellow Dog Linux is variant of Red Hat Linux designed for Power PC CPU architectures, such as those sold by Apple and IBM.
Your Website’s Greatest Asset – Your Contact Form	A poorly designed and managed contact form on your site can cost you big. Find out how to maximize its value. Improving your Contact Form Performance
YOY	Year over Year is a means of comparing data from one year to the next. For example, to compare online holiday retail revenue from last year to this year.
YSM	Yahoo! Search Marketing
Zero Latency	Latency is a time delay between the moment something is started, and the moment one of the effects of that event begins. When there is no time lapse between the event and the effect, its called zero latency. In analytics, this term is used to describe instantaneous receipt of data and the ability to analyze and act on that data.
Zero-page Visit	A visit that included no page views. This is possible if a visit consisted of at least one request for a non-page file (such as a graphic) but no page files (such as .htm, .asp, .jsp, or .cfm.)
Zeus	Zeus is a commercial web server software application that competes with Apache, Microsoft IIS, and iPlanet web server software systems.
zip Variable	The zip variable tracks the U.S. zip code in which a visitor is located.

Goals Implementation Plan

2014-02-17T00:00:00+00:00

Once in a while we realize (think of) goals we wish for. But how to map those things for day to day? How to make sure we don’t forget them? (And avoid getting pulled into the daily to day rat race). How to make sure we keep working on the right things and keep making progress?

Define the main goals

Work it out backwards

To help decide what the goals should be, an idea is to work it out backwards, start with the outcome, think of what life you want, the setup / work / activities that make you the happiest. imagine what the typical day would be and extract from there the goals.

Question it

Once set on a goal, re-examine it again. Question the goals thoroughly, to reach a clearer understanding:

Why spend time on this? Imagine somebody you care about asks why to spend time on this, what would you say?
What I want to see happening? What i expect the future to be on this ?

A mind map can be usefull to layout things and help think it trought, as well as writing a text on it.

Example Goal: I want to play guitar Why ? :. I enjoy it, as a pause from school / work, this makes me feel good and i really like to make my own songs. Whats the Future ? :. Play in a band, do a live concert, record an original song, record a full album.

Evaluation (the stakes)

Imagine you going to be evaluated at the end of every month (for example). Or even better is to actually find a way to commit to something real that will cause pain when not achieved. If you do your own evaluation, try to be harsh and cruel (excuses wont work).

So define what are you going to be evaluated on?

Sometimes time spent on a project is a good correlation of work done, but not best way.

But define also other concrete goals and accomplishments.

Example: Play Guitar: learn and record a new music every other week. At end of month i need to have four musics that are enjoyable and keep getting better. If that does not happen, then it wont get me any better in my musical career, and i should drop it altogether.

In general the bigger the pressure (the stakes) the more likelly is to make progress.

Track time down

If time spent is a good metric, then need to be able to monitor it.

Probably worth knowing if one of them is being under-done and maybe needs a bit attention, and the others should have less attention. Example: haven’t done sport for a month already, next 2 weeks focusing mostly on sport to bring back some balance.

Review constantly

Do frequent reviews (weekly) for example, and plan adjustments. (check out the sprint planning of scrum for ideas on doing this).

Find that constantly looking at the goals, keeps them in mind, and present.

(I keep a todo list, that i review daily, also so to avoid forgetting boring stuff i need to do).

Filter what tasks are important vs not

Learn to filter what tasks are important and what tasks are not. Even if related to the project at hand, not every single task contributes in same amount to the end goal.

Heuristic could be: How much does this contributes to my goals, compared to the other tasks is this the one that contributed the most (prioritize accordingly)

Visibility (Marketing)

Making results of work visible, might open up new possibilities, eventually more resources.

Precepts (Rules of thumb)

Block time, Set timeouts
Quiet time, remove distractions
Its a balancing act, but avoid jumping frequently between things (no multitasking)
Allocate a max of time per day/ week for pointless tasks
Simple planning, half a day for each tasks
Avoid meetings
Identifying what is important (and especially what is not) is key !

References:

Scrum methodology
Meta learning post: http://al3xandr3.github.io/meta-learning.html

Game of Life

2014-02-09T00:00:00+00:00

( This is not about the Conway’s Game of Life, although there are here some overlapping theories )

If you see life as a game, then maybe you can use strategic thinking to optimize it.

The Goal

Nature’s Purpose

Make no mistake about it, you are a random event of nature, whose purpose, like as any other animal out there, is to continue specie survival.

whats the life purpose of a lion ? of a ant ? of a flower ?

So you are somewhat optimized for that, examples of bugs in the “matrix”:

Sex drive overrules rational thinking. People do extremely stupid things to get laid.
Kids takes your time / money / sleep / relationship / fun away, but you still can’t stop yourself from forgiving, loving and wanting them.

The Higher Purpose

Humans have big brains and like to complicate stuff, so normally happens, we search for a higher meaning in life, for something more. Also because nowadays we live longer and the survival is getting easier (mass production food, governments / organizations that assure basics needs, etc…).

So often we go the route of doing something big, a higher purpose, with the argument that it will be recognized by all of the human civilization: scientific achievement, art, books, peer recognition, etc… Some things even influence others long after the author’s death, that is very cool, no doubt, but i think that each of these achievements are a means to an end and the “end” is in fact just a search for feeling good and happy.

Feeling good and happy is what you been trying to get to all along. (And of course minimize the feeling bad / sad moments, the negative happy).

So, what do you feel good about ?

Although people feel good about many of the same things (like sex, food, entertainment, peer recognition), it also happens that what you feel good about is very much influenced by your upbringing, life experience and expectations. Plus, it changes over your lifetime.

( And this is a good thing in the sense of limited resources competition, if people are happy different about different things, then they don’t need to run over each other for the same thing. ex: money has this problem now. )

The scientist who does a theoretical breakthrough, a biologist who discovers a new bug specie, owning a Ferrari, owning a car, playing a concert, working in IT, traveling around. These achievements / events might be received as extremely happy, less happy, neutral or even unhappy depending on the person.

So go figure out what things makes you feel good and set it as your end goal. And be aware that it might change over time.

The Game

Briefing:

What is the goal (what makes you feel good) ?
Your resources:
- Basic Resources:
  - Time
  - Money
  - Health
- Developed Skills:
  - Professional Job Skills
  - Languages
  - Motivation
  - Being productive
  - Eating healthy (+ exercising)
  - Understanding & Influencing people
  - etc…

Picking up the right skills to develop in is very important

You can develop further skills to open up new possibilities and maybe achieve the goals to a fuller extend (or new goals). But your basic resources are limited, and you will have to trade off some resources to be able to develop a new skill. So pick wisely, similar to investing money.

Learning english opens up many coutries you can go to, many new literature and information resources you will be able to access. Is in fact the closest to a standard language (in the western world).
Computer programming skills can get you a (nice) job almost anywhere in the world.
Being nice and understading people can help make new friends.

And some new skills might take you nowhere closer to your goals.

Phases of the game

baby
teenage
young adult
parent
old

Over the years, resources availability will naturally vary:

Kids don’t have money.
Parents don’t have time.
Old’s don’t have (good) health.

So there’s the need to plan, to optimize experience.

Reference

http://oliveremberton.com/2014/life-is-a-game-this-is-your-strategy-guide/
The Science Behind a Happy Relationship: http://d24w6bsrhbeh9d.cloudfront.net/photo/a09nZ7z_700b_v3.jpg

Understanding People

2014-01-10T00:00:00+00:00

Someone was so rude just now, with no apparent reason, I’m going to be rude back ! No, wait, maybe has nothing to do with you… was it a misunderstanding? did i touch a nerve?

These are some notes of my learning on understanding people.

Introverts vs Extroverts

People tend naturally to be either one. There is different levels of mix of these for each person, and also extreme cases of extroverts and introverts.

Introverts (when compared to extroverts)

By definition is directing interests and behavior towards oneself. And less to the exterior. Need quiet, alone, time. Tend to think more before speaking.

Extroverts (when compared to introverts)

By definition is directing interests and behavior towards others and external environment Need people around. Better at faster and confident decisions on a deadline without complete information.

Professional

Selling, Marketing, is definitely a world where extroversion is key advantage.

Cultures are different

US is known to be a center of extroverts Finland is known to be a center of introverts HBS seems to be a school of extroverts (that incentives / teaches extroversion)

Danger of extroverts

Extroverts, naturally, will get their way more often (when compared to introverts). Extroverts speak up first and more loudly in meetings.

Starting with the premise that extroverts and introverts have the same likelihood of being right (or wrong).

This means that more often bad ideas / solutions will win and prevail because they were argued by extroverts.

Reference

Book “Quiet: the Power of the Introverts”

Categorizing people in 4 areas

Take this with a grain of salt, is a strong simplification / model. Models are always wrong to a certain degree, but can be useful if applied to right situation.

Every person has always a bit of each category, but some people will be more heavy on a particular one, or even in 2.

red: think angry, ready for an argument, writes in CAPS LOCK, power, control, approach you very directly
yellow: think sun, happy, enthusiast, positive, friendly, extroverted
green: think grounded, calm, caring, democratic, desire for understanding, supporter, helper
blue: think blueprint, numbers, facts, very analytical, all about the numbers, not emotional, introverted.

reference

discover insights survey product

Body Language

There are common traits that our body does naturally when happy, sad, scared, etc… But everybody might be by default slightly different, so is very important first to understand the normal baseline. And then look for changes to that baseline, those will be the tell signs.

Eyes

Open up when: happy, welcoming, want see more of it.
Closing (squinting): hiding something, not liking someone, lying.
Not looking: avoiding, very unpleasant.

Arms

Crossed arms signals defense, not being comfortable, not liking something, position of strength.

Lips

press lips = protection, discomfort

Head

Massaging head is used to calm down, is often a signal of discomfort (to calm down from discomfort).

itching head = discomfort
rub back of the neck = discomfort

Music Production

2013-12-01T00:00:00+00:00

Sound

Sound is a mechanical wave that is an oscillation of pressure transmitted through some medium (like air or water), composed of frequencies which are within the range of hearing. wikipedia

Touching still water creates (mechanical) waves, sound is the equivalent on air, sound is about agitating still air, generating a (mechanical) wave that propagates through air when when reaches our ears is perceived as sound.

A Speaker works by pushing air with its cone at a specific vibration.

Properties of sound:

Propagation, is the sound moving around from point A to point B. It has a certain speed of propagation and is also connected to our sense of space, because of surface reflections (reverb, delay).
Amplitude, how high (and/or low) are the sound wave curve peaks. Interesting enough sound waves amplitude are in fact (counter intuitively) a longitudinal wave, like sound pressure, i.e. a sound wave that propagates faster is perceived as a louder sound. But in practice sound is represented as wave for easier manipulation.
Frequency, frequency is how far apart are the peeks in the sound wave, the closer the higher the frequency, these translate directly into music notes. Every instrument an microphones have a set dynamic range the produce. Every note on a instrument corresponds to a certain frequency, Ex: A2 = 440Hz.
Timbre, relates to the type of sound. A sine wave is about the simplest sound wave possible, real live sounds are actually an aggregation of several waves together, that aggregation defines the timbre. (Could i use a FFT to break a timbre into its composing parts?)

Frequency of a guitar: Guitar strings are E2=82.41Hz, A2=110Hz, D3=146.8Hz, G3=196Hz, B3=246.9Hz, E4=329.6Hz

Drop D tunning: D2=73.42 hz

and the highest on guitar, the 24th fret on the high E string = 1318.4 hz

Signal level & DI

The term dB (decibel) by itself means the amount a signal level changes in relation to wherever it started. When you see gear specs that say “-10 dBv” or “+4 dBu”, they are telling you how much lower or higher the average output is relative to a specific fixed reference voltage. That voltage is usually either 1.0 V, referred to as “0 dBv”, or 0.78 V, referred to as “0 dBu”.

Common Types: Line level - processors, racks, pedal effects ? 2 kinds

professional: +4dBu
consumer audio: -10dBv

Instrument level: -20 dBu

Microphone level - -30 dBu

Guitar signal is high impedance, -20, unbalanced (degrades faster over long cables)

A DI box converts guitar signal (-20 dBu) to microphone level (-30 dBu), also converts it into a balanced (better on long distances)

Reference: http://www.ovnilab.com/articles/linelevel.shtml

Microphone

Types:

Dynamic: rugged, for live loud stage usage, has a fairly small, typically directional. ex: shure SM57, SM58
Condenser: very sensitive, not for live stage, better for studios, pickup a

Microphone polar pattern is about the area they pickup around the microphone, typically the Dynamic have smaller area than condenser, SM58: cardio shaped polar pattern. Condenser often are less directional, and more likely to have omni directional, or more wide range polar pattern.

A wider polar pattern captures more from room, environment, that can be very useful, depending on application.

Noise

All electronic devices have a transformer that works at a specific frequency, when that frequency is within the human hearing range: 20 to 20khz it will create noise in recordings. Room Light controllers (that allow diming) are noisy.

DAW

Project Checklist

Proper project tracks and location
Digital Audio preferences and Hardware Setup (see Basic Settings)
File recording: all uncompressed

Audio Settings

48,000Hz, 24bit, 128 (less if possible) buffer size.

Individual tracking checklist

Name individual tracks, the recording are often named after the track
Mono vs stereo
Set the level

When too many plug-ins (CPU having a hard time)

Render the audio into a 2nd track.
Disable the original track with effects - but don’t delete, to be able to re-amp (fine tune) the FX’s again later.

Saving processing power on doubled guitars (left and right)

Doubled guitars can be panned left and right into a bus, that enters into a stereo guitar amplifier. Individual sub-tracks are without any effects. Saves processing power.

Several Takes, merge the best bits (audio photoshop)

Do several takes for same track like rhythm guitar then take then best form each and merge into 1:

Do very short crossfades, between each segment to avoid sound clipping. (reaper seems to do that by default)
Do a track freeze (or merge), that renders it all into 1 audio track.

Track Automations

Very common that volume of track is not same for the whole track, do track automations to change volume over time. This is very useful to make things standout / hide when required.

MIDI

MIDI often triggers either a sampled instrument or an synth algorithmic one.

Velocity is the intensity then note was played, lower to make it quieter.

Quantization, good trick to fix performance, use quantization to the notes that were played, quarter notes, eight notes etc…

Quantization strength: do at 20%, check if good enough, if not then 20% again, etc… it keeps the human feel to it.

Equalizer

In a mix, each instrument to be heard properly lives in a certain region alone, EQ strategy is to think what instruments dominate the space frequency and accentuate that, for example for a bass: cut the high end (also because bass does not emit those frequencies, most likely is noise) and slightly boost the low freq. Guitar typically lives in higher frequencies. Vocals, guitars piano live in similar frequencies. Cut radically to give space for all instruments at same, decide the cuts depending on function for the music is piano dominating the top, guitar the mid, or vice versa?

Better to cut than to boost. Boosting a lot sounds fake.

Types Of EQ:

Low / high pass - only lets low or high edges to pass ( high pass often used to remove noise from mic’d instruments
Low / high shelf - reduces low or high edges to a set level ( typically the bass treble nobs of amps )
Band Pass filter - a bell shaped curved window to boost or cut the frequencies on that window. Not so common to use as the other filters. A great way to find what frequencies a certain instrument dominates is by using a bell filter and moving manually across the whole frequency range and hear what frequencies make it weaker.

EQ in a real mixing console are typically

High Pass (cut lower than 75hz, removes noise)
low shelf (80hz)
high shelf (12khz)
bell low mid low (sweep from 100 - 2k)
bell mid high ( sweep from 400 - 8k)

EQ reference guide: http://www.idmforums.com/showthread.php?t=18237

Dynamics

Effects that act on volume, very important. Maybe a bit abused nowadays where there is little dynamics and volumes are maximized.

Types of Dynamic Effects:

Compressor - reduces the level of the loud peaks (the transients). Removes dynamic range. 1st bring down the transients. 2nd raise the gain up. At the end means increased volume of quiet moments
Limiter - Original to cut the sound above a specific threshold. Works like a compressor, but automatically increases gain. Is essentially a compressor with a high ratio 10:1 or more. Loudness maximizers are limiters.
Gate - is a compressor at a very high ratio with a very fast attack. Its used to cut noise bellow the threshold.

Parameters:

Threshold - reduce the level when volume goes above threshold.
Knee - blunts the threshold hit points, soften the transition
Ratio - input to output compression ratio: 2:1, means output is half the input. Defines how much the volume comes down when it hits the threshold.
Attack and Release: Control how fast does the compressor/limiter kicks in.

Compressor application - rides the wave, lowers the highs, augments the quiets. Imagine following the voice track (actually very common), when gets lower then raise it up, when very loud then lower volume a bit. a Compressor does this automatically its an analyzer that by taking averages of x samples creates a simplified wave shape called envelope that is used to update the volume knob. The envelope curve goes up the volume goes down in case of a compressor.

Transients are the peaks in sound wave.

Should always mix / play at same volume level always, volume influences our natural ears eq.

Delay & Reverb

Short slapback delay can often do same function, and is cleaner than a more complex noisy reverb. A short slapback simulates a 1 time reflexion. A reverb is a fairly more complex (and noisy) slapback.

Guitar Tip Try slapback ( left:80ms, rigth:90ms, with left right crossover and wet at 30%) see if good enough to replace reverb. As a cleaner and tighter reverb.

Use reverb to give the impression of a room, of a certain space the instruments are in, thus often good idea to apply a whole track reverb. When reverb is applied to individual tracks then is like they were in different rooms or at different distances.

Chorus, phaser, flanger are oscillating short delay type effects.

A long delay often has the repetitions modified, for example cutting some high frequencies to make it clearer that they are reflexions, sounds more natural.

Long delay is better if time synced with track.

Long delay, better when left different than right, for a ping pong effect.

Best (most realistic) reverbs are convolution, but they can’t be fined tuned as much as the algorithmic ones, typically algorithmic one the goal is to best simulate a space.

Getting the right sounding reverb often involves trial and error.

Space (depth) in a stereo mix

Properties that affect Perception of deep in a mix:

Volume - the closer the louder
High End - long distances loose high end became duller
Reverb - the closer the mic the less gets from room, the less reverb
Stereo with - low sound on 1 ear is very close, as it gets further start to hear a small delay on the other ear, when very far no separation of left from right.

The main focal point should be at the center, thus widening a certain track or instrument can give space for another central point. For example guitar rhythm on stereo wide and guitar solo centered. The center can then be filled with instruments across all frequencies, typically: low kick, bass, guitar, vocals and high hat

Important is that left and right are in balance.

Triangle shape balance strategy: low end is central and the high end can be spread wide.

Guitar Tip Having left (or right) channel delayed up to 20ms, gives a close impression of doubled guitar. Somewhat similar to recording 2 takes of same lick, panned left and right.

Synths

Synths purpose is to engineer an instrument’s sound. A model of instrument sound. They have a nice modular design that is combined together to recreate an instrument timbre sound.

Works by first generating some kind of simple wave, then applying several filters to it, including time based filters, like oscillating ones, that for example create vibrato.

Building Blocks:

Oscillator - simple sound generator, sine wave, square wave, square wave, etc
Filter - low pass, high pass, band pass
Amplifier - volume knob, really works together with the envelope
Envelope - the parameters that define volume when note on and off happens.
Low-frequency-oscillator - cyclic/oscillator that is used to dynamically change other module parameter.

Example: Tremolo is a triangular LFO changing the volume over time. Compressor is like a, dynamic envelope, that in real time averages the volume dynamically.

Guitars

Woods

Mentioned often times as good woods combination for an electric guitar:

Light body: Basswood back, maple top
Dense wood neck

But there is a good deal of taste, and even the same wood can be of different quality.

Reference

This is mostly a summary of things I’ve learned from the music production class at coursera. With added findings from my own guitar recordings.

Favorite Talks

2013-07-26T00:00:00+00:00

Computers can/should augment humans abilities manyfold

Inventing on Principle, Bret Victor, 2012: http://vimeo.com/36579366 A big vision makes you go out of the box and thus have more progress (same as Steve jobs ?). This talk (and the demos) inspired several new software projects, that are trying to break the disconnection between code and run, to new very dynamic interactive development environments.
Stop Drawing Dead Fish, Bret Victor, 2012: http://vimeo.com/64895205 For the most part, we use computers in same way as paper (build static media - pictures) or tv (play static media - movies), but we can now leverage computers/devices to take advantage of user input(touch screens) at runtime to improve manyfold the way we create media. Computer code is (mostly) static and derived from language (not visual).
Media for Thinking the Unthinkable, Bret Victor, 2013: http://vimeo.com/67076984 Written Language, math notation revolutionized human progress, becuase they introduced tools to think/manipulate thinking in new ways and reach the (previously) unthinkable, but they were invented at a paper age, computers are the new paper and they provide a much richer support. Thus computers should allow for more expressive and interactive “notations” tools, Bret is looking for it.
The Future of programming, Bret Victor, 2013: http://vimeo.com/71278954 The 60’s and 70’s was fertile period for the field of computer programming, the computer was new and hot and a lot of brilliant ideas were being tried out, but somehow many of those ideas got forgotten over time and in many ways we went a step back. There is a natural resistance to change, and that is a big part of the reason, but more importantly the newest generations never were exposed to those ideas, so there is the risk that many things will be lost, and need to be rediscovered again.
Sonic Pi: Teaching computer science with music. http://www.youtube.com/watch?v=KYO9N4kDK_o Computers should be used in a cyborg way, to extend the human powers, thats why kids should learn to program. Sam has developed a musical system on the Rasperi Pi, that allows kids to produce music by using simple programming. Talk includes Clojure and Ruby programming languages.
Alan Kay - The Future Doesn’t Have to Be Incremental: https://www.youtube.com/watch?v=gTAghAJcO1o But anything by Alan Kay is good !

Improving life with numbers

Knowledge Tracking - http://quantifiedself.com/2011/11/roger-craig-on-knowledge-tracking/
Quantifying Seat Time - http://quantifiedself.com/2013/07/mark-leavitt-on-tracking-and-hacking-sitting/
Spaced Repetition and Anki - http://quantifiedself.com/2013/06/roger-craig-on-spaced-repetition-and-anki/

Data Analysis

Beyond Freakonomics: New Musings on the Economics of Everyday Life - https://www.youtube.com/watch?v=Tmo9YsNXWCc Steven Levitt

Others

Inheritance - radiolab podcast
How doctors want to die - radiolab podcast
emergence - radiolad podcast
Peek prosperity accelerated crash course: https://www.youtube.com/watch?v=pYyugz5wcrI. We currently live in a non-scalable manner: oil is a finite matter but we are fully dependent on it. World population is growing exponentially, we are bound to hit a breaking point where we wont be able to get more oil. We should start finding alternatives, to be less dependent on it. Interesting: Really nice exponential growth and curve fitting explanations.

Product and Application Design

2013-07-25T00:00:00+00:00

Product

#1 Give something back to users

Computers are made to make people’s live easier, do something that makes life easier/better for end user.

Dogfood: It should solve my problem, very important that me the inventor need it also and i must be an intensive user.

Reference:

http://paulgraham.com/startupideas.html

Application Design

Always consider the end user point of view

How would my main target audience use it use it? Does the UI makes sense and is obvious, could things be simplified further.

UI: Easy and simple to use

The ui of the tools determine / limit their usage (success in adoption, incentive a certain type of design, etc…). Ex:

GPS on the phone, maybe people not aware they should downloads maps before, thus making typical experience fairly unsuccessful and low adoption.
Euro money, nations had a money interface change, whenever a nation goes from a money of 100 to 1, then 1 looks small and not much -> incentivates spending (? just a theory).

Requirement: must be easy and get out of the way, user does not need to know internal technology details. Do Roleplay, imagine how a user without knowing the technology details would use it.

Good example google search box, the most useful and simple (to use) application ever ?

If app has many commands make them search-able (example: lightTable or sublime_text find command). Use search box instead of endless nested menus, so that users only need to have a vague idea how it was named, don’t make them learn endless nested menus or shortcuts.

UI is a key thing, needs constant constant curation and re-thinking, it should enable a good workflow, should be a multiplier of the program core abilities, UI is not just a way for user to reach its core abilities, think of Ableton live UI, that even breaks with the original paradigm of the music production and mixing, when needed, and takes advantage of computer abilities to make for a better workflow.

Program Architecture

Backend is an independent program exposed as a service (by an API endpoint), an application is then built by using several independent services (principle of composability). Follows Server <-> client model. (example: SuperCollider, end users can use whatever language they favor to communicate in a very simple protocol to the server)
Do simple API’s (REST like?)
The services are built independently and their implementation details are hidden (each doing 1 thing well).
This allows to replace backends while keeping same api interfaces.
Allows to re-use the api services for different applications, on different UI interfaces for example, and to be re-used by others.
Interface is in javascript so that can run both on cloud, desktop and mobile devices.

API product tutorials

An API style of product is a product made to be re-used (as all should be), and that requires proper references on information how to use it:

Online documentation, maybe a getting started quick doc.
Ready made examples
Online video training, setup, examples, walk-troughs.

Community

Keeping in touch with the product community and maintaining an active community is key.

Forum, to communicate bidirectionally with users and solve common FAQ’s.
Maybe a dedicated FAQ section.
Regular news means a lively, active (& trustworthy) product.
Youtube videos, with demos, news, walk-troughs.
Social Media, keep a page and feed on: FB, TW, etc…

Branding

The visuals of how the product is exposed to end users, logo, brand design, colors, styles, approach that should be always consistent recognizable and associated to the product. Helps to make a stronger consistent product.

New vs Old

There’s a balancing act of old and new products, not all new products are good, most of them will be unsuccessful (how to distinguish the good new early ?). But the future is with disrupting the old, creating newer, more efficient ways or how some say is really by progressively re-inventing the old. (Also is there any real new ? I would say is more a progressive evolution, like the natural evolution theories)

At the same time some old things were too advanced for the time, that they were not adopted immediate and got forgotten (Lisp, Smalltalk). There are excellent gems of the past that are worth re-using in the future.

Would imagine that being able to leverage new technological breakthroughs when re-inventing more efficient ways to do the old is the way to go ? ex:

Skype, replacing old telecoms with internet cheap phone.

References

Jeff Bezos (Amazon) 25 quotes

Data Visualization and Presentation techniques

2013-06-20T00:00:00+00:00

Visualization and presentation techniques to better communicate the meaning of the results.

Is about Telling Stories with Data.

Summarizing / compacting data into a easy digestible format.

Dashboards & Reports Goal

To be really actionable they need to have:

Data
Insights - Whats the interpretation / insight from the data
Recommendations for Action. - What to do next
Business Impact. - What happens if we-don’t-act / we-act

They should tie to the outcomes of the business. (Increase revenue, Increase customer loyalty, Reduce costs, Increase user count, etc…)

Reference:

http://www.kaushik.net/avinash/digital-dashboards-strategic-tactical-best-practices-tips-examples/#comment-682950

Reports

The report should be optimized for the problem at hand and updated if needed during the project to highlight problems and new findings as the project develops (agile). There is no single report view that is ideal for every kind of data analysis challenge.

Minimum possible to “report” the situation, remove what not absolutely needed.
1st thing to show in left, top corner is the key info, normally a table with summary: metric (or KPI), last Day, prev. Day, %Change, Trend. Others if needed.
- then on the right of that, or bellow, or in other tabs should be possible to do further drill-down and go into further details.
Red, green (and) yellow typically translate into good/bad/so-so, avoid using them to represent information that is neither. Go instead of neutral, like gray.
- Do use red green, when presenting results, like day change and month change, color red is lower, color green if higher.
Diagrams:
top to bottom (like flow chart).
use rectangular arrows instead of direct arrows.
be consistent, same color and same shapes for the same purpose, don’t mix, consistency helps with message.
Rates, calculated values, total values, etc… should be colored (and shaped) differently.

Reports Data Validation: Whenever an average value is presented and analyzed, it should have a confidence measure (standard deviation for example, also confirm for normality)

Report Inspection View

Start with 1. Summary top view then allow for 2. Drill-Down view

Often the important is the final rate of something, but when that rate drops massively then it requires investigation to understand why. So in parallel of the view (chart, table) of the total rate is probably good idea to also have a view of several points that participate on that total flow, the inspection view. When the Inspection view is setup then is very quick to quickly inspect what points changed that have influenced the final rate. In web analytics the inspection view is for example a view of the flow of pages that lead to a conversion, and having every point of that flow measured in a report view.

Recurring vs Ad-Hoc Reports

Recurring reports are fully automated, no need to touch them, the automation part is often tricky and imposes many limitation on the final report (visualizations, extra calculations, etc…). Recurring reports are often simpler than Ad-hoc ones. Ad-hoc are a one time only, involve quite a bit of manual data calculations, but are the most flexible.

A somewhat useful mixed solution is to do ad-hoc reports that are almost fully automated. Ex: just copy paste a table of data into one excel sheet and all rest of report updates by itself.

An Ad-Hoc Analysis Task Checklist

Before
- What is the Macro of the activity? ex: a payment flow
  - What is the goal of this analysis how it relates to the Macro? ex: how users pay
- Understand how the tracking and data is collected: beware of data bias, data collected in duplicated, sessions, unique visits, etc…
  - If is page flow, sometimes useful to visually layout the screen-shots and how tracking and page sequence happens.
- Define and agree with stakeholders what is the output & communicate the constrains / limitation / impediments.

Macro - Identifying the Macro purpose of the data collected is key. What environment where data is from, how it was collected, if its from a web page, what does the page does, whats it purpose? etc…

During
- Final output short and simple as possible,
- Clear clear clear(include screen-shots if needed),
- Target an audience without any knowledge of the matter
- Report not everything, just the conclusion. optimal, just 1 chart / diagram or 1 table. Can have different levels of drill-down
- Hide the unhelpful data, think from final viewer pov. Remove the None’s / empty values…
- The end audience don’t care the amount of work, only if it addresses well or not their questions - no point of putting everything you found on the output.
- Simple correct English
- Simple graphs
- Put it into perspective / compare with similar / total
- Point out: trends, outliers, counter-intuitive facts
After
- Is the output answering the original question properly? Is it really contributing to something ?
- What have we learned?
- Have an opinion / recommendation - having an opinion means the numbers are understood well enough, and you can see what should be next steps.
- Beware of short term data view, zoom out to get whole picture, example: a week worth of data might not be telling the whole story.
- Work out ways to validate / test numbers - often different points of view on same data ofter reveals gaps
- Predict whats going to happen - from deep understanding

How to Lie with statistics

And, more importantly, how to identify we’re being lied to, with statistics. Take this with a grain of salt, they essencially should be look at like things to avoid doing.

A too precise average statistic is suspicious, normally, there’s some decimal points.
Average can be a mean, a mode or a median
Testing, guarantee an adequate sample size. To lie just use small sample size and try many times, eventually one group will show a good result.
Charts without axis labels and numbering are meaningless.
Common sense is required while reading numbers, nevertheless. But is better assume reader wont really know.
Average of rates is not same as rates of averages, avoid the 1st.

Reference

Book: how to lie with statistics
http://mathwithbaddrawings.com/2013/12/02/headlines-from-a-mathematically-literate-world/

Odds Notation

The “1 in every 45” is an intuitive way to display a ratio. (Odds notation)

Odds can sometimes be more intuitive to understand that raw percentages. Even when using % might be good to also show the odds notation value next to it.

Notation “A:B” describes how much A we get for every B. example: “1:4” means 1 in 4.

Odds are more intuitive than percentages:

1 in 2 = 50% (= 1/2 * 100)
1 in 3 = 33%
1 in 4 = 25%
1 in 5 = 20%
1 in 6 = 16.66%
1 in 7 = 14%
1 in 8 = 12.5%
1 in 9 = 11.11%
1 in 10 = 10% etc…

Convert percentages to Odds notation

Start with 40%
get the decimal version (x/100): 0.4
1/x (how many times does that fit in 1): 1/0.4 = 2.5
Odd: 1:2.5
(optional) round up number: 2:5 (multiply both by same value, this case 2x)

Convert Odds to percentages:

2:5
2/5 = 0.4
0.4*100 = 40%

Reference

“Odds require less computation (than percentages)”: http://betterexplained.com/articles/understanding-bayes-theorem-with-ratios/
Reference: http://lifehacker.com/practical-math-shortcuts-for-everyday-life-1495337792

Put it into perspective, add context

Lets say some process had a significant speed improvement, measure in milliseconds, for example the speed of a web page loading, sometimes just giving out the number might no be ideal, because is hard for an audience to fully grasp its significance, milliseconds are small and many people don’t often manipulate them. (And to be fair this is equally useful technique for analysts also, to be sink in the change size.)

Is often useful to put it into perspective.

For example: “The speed improvement was in the same order as having Husein Bolt going instead at speed of a cheetah”

Caveat: Be sure to make valid enough comparisons. Some things are just not the same.

On Rates

Prefer conclusions from relative percentages, instead from absolute numbers. e.g. in a funnel traffic analysis, the absolute values of a drop-off step are going up, is tempting to infer immediate that drop-off is going up, but is not mandatory true. It could be because the overall input of traffic is also increasing and the drop-off rate is actually constant. Calculate relative percentage = drop-off / total.

Rates are excellent to use in general, as a relative measure (instead of absolute) to measure progression over time, use for comparison to other rates, etc…

But they could sometimes miss to give the whole view, example: lets say we looking at a web site login success rate, imagine that login success rate drops massively overnight, although nothing has changed on the login itself, what happened ?

Typically we might have a situation where a new predominant link somewhere has started to send loads more traffic, that might not all be that interested in login, and even be there my misunderstanding. Thus the input traffic will grow massively but the number of logins might not grow at same volume, as users realize they are in wrong place and leave.

So in this case the rate alone, will hint that login is performing much worse, while the actual net effect is positive.

A way around this is to include with the rate the input volumes, especially if they are trended.

So we can say we login rate dropped, but this is because of a massive influx of unqualified traffic hitting the login page.

Charts

Sugarcoating and eye candy can make the good great, but it won’t help a poor insight.

Is said that a bar and line charts can cover almost anything needed to be visualized, it has less impact as a visually stunning info-graphic, but with about the same representation value and quicker to get done.

Viz aesthetics: chart standout the important elements, lighten in color, the less important ones.

Histogram

Shows the distribution of the data, what are the most (and least) frequent values

https://www.youtube.com/watch?v=asEuFvWGJDs

Prepared Excel: histogram.xlsm

Boxplot

Good for

Check if distribution is symmetrical around the average.
Inspect data for Outliers.

https://www.youtube.com/watch?v=DNpvSg2X0xQ https://www.youtube.com/watch?v=ZFbPnwKwVWk

Prepared Excel: boxplot.xlsm

Boxplot also is very useful when comparing 2 data sets directly:

Here is an excel with 2 box plot side by side: boxplot2.xlsm

Treemap

Good to compare the size of a collection of items in 1 picture, easy understand their size, good to put it into perspective.

Can build nice looking ones at http://infogr.am/

Streamgraph

Good to see volume change over time. Combines Area and stacked Area charts together for the best of both worlds.

http://www.nytimes.com/interactive/2008/02/23/movies/20080223_REVENUE_GRAPHIC.html?_r=0

To build them:

http://research.microsoft.com/en-us/projects/msrdatavis/streamgraph.aspx
d3.js: http://bl.ocks.org/WillTurman/4631136

Maps

Maps are great tools, for a stats covering the whole world

Reference

http://www.kaushik.net/avinash/data-visualization-inspiration-analysis-insights-action-faster/

Music Theory Chords

2013-05-01T00:00:00+00:00

Notes

C	C#	D	D#	E	F	F#	G	G#	B	C
r(1)	.	2(9)	.	3	4(11)	.	5	.	6(13)	7

Guitar Fretboard

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
3	4	.	5	.	6	.	7	1	.	2	.	3	4	.	5
7	1	.	2	.	3	4	.	5	.	6	.	7	1	.	2
5	.	6	.	7	1	.	2	.	3	4	.	5	.	6	.
2	.	3	4	.	5	.	6	.	7	1	.	2	.	3	4
6	.	7	1	.	2	.	3	4	.	5	.	6	.	7	1
3	4	.	5	.	6	.	7	1	.	2	.	3	4	.	5

2 = 9
4 = 11
6 = 13

Chords Triads

	I	III	V
maj	r	3	5
m	r	b3	5
dim	r	b3	b5
aug	r	3	#5

7’s

name
(dominant) 7	maj + b7
maj7	maj + 7
m7	m + b7
dim7	dim + bb7
half-dim 7	dim + b7

Extentions

Add the 9, 11, and 13th. Theoretically they are piled up, as in examples bellow, but in practice some notes are omited when adding these extensions. Otherwise they become dissonant or even impossible to play(depending on instrument).

Examples:

name
maj9	maj7 + 9
m11	m9 + 11
13	11 + 13

Inversions

Its about changing the order of the notes in sequence of a chord, its usefull for: Arrange bass/treble lines, on a chord progression. Different flavors of the same chord. Get reasonable fingerings.

Naming

name
maj	major
m	minor
dim	diminuished
aug	augmented

My Music Online

Programming

2013-04-10T00:00:00+00:00

Precepts of Programming

State is the root of all evil, avoid global variables, and shared state.
Go for pure functions with well defined interfaces (for the same input always same output). Create building blocks using pure functions, that are composable and reusable, also makes for better testing.
Analyse and Design up front, ideally away from computer, figure out what is needed and how to tackle it, trade-offs between tools and approaches, don’t start coding right away - Hammock driven development
Constructs vs Artifacts - its not about how pretty and short the programming language is (the constructs), but more about how well it solves the problem, maintenance, scalability (the artifacts). Different task will focus / prioritize different artifacts. - See Simple-Made-Easy talk.
Write tests, because they can often be a life saver! Well crafted tests. But beware that test are not ultimate warranty that all code is bug free.
Bugs are per line of code. Code as little as possible, but make clean and clear code.
Practice to recognize complexity and how to avoid it and simplify it. Break a complex thing into many simple / independent composable parts. It is easier to work with simple - See Simple-Made-Easy talk.

Reference:

http://www.infoq.com/presentations/Simple-Made-Easy - define simplicity, recognize complexity, tackling problems only scales when handling with simplicity. Tools to simplify and avoid complexity in coding.
http://www.youtube.com/watch?v=f84n5oFoZBc - Hammock Driven Development - Rich Hickey
https://github.com/thomasdavis/best-practices
http://www.infoq.com/presentations/Design-Composition-Performance - Rich Hickey

Apps

Make things runnable on command line as standalone bin, easier to integrate with other apps later on.
Event better make it accessable from a url endpoint (a service), as an api, as a standalone, json seems to be the norm these days, but data format have changed overtime, and will probably continue changing. CSV is a good fit to data analysis + tools support + easy to work with.
GUIs are not composable. But they can be extendable with plug-ins. And they are easier to use (compared to command line), and the current standard for applications

Functional Programming

Functions as values, that can be passed to arguments of other function. Functions as 1st class citizens.
Avoid shared state. Avoid variables outside of functions.
Design functions that take other functions as argument, makes them much more generic and useful in the long term.
Iterate with map and reduce and fold, avoid iteration where state changes.

Underscore.js and its migrations to different languages seem to be a standard for having the functional programming building blocks as a library to any programming language.

References

Writting documentation: http://jacobian.org/writing/great-documentation/what-to-write/
How Software Companies Die (half joking): http://www.cs.cmu.edu/~chuck/jokepg/joke_19970213_01.txt
State of affairs 2014: https://www.tbray.org/ongoing/When/201x/2014/01/01/Software-in-2014

Programming languages

Biased and Opinionated list.

Programming languages are tools, choose best tool for the job at hand.

Coders that love coding just for coding enjoyment are very passionate and picky about language (the majority ?), the ones who’s goal is just about final product will be more tolerant to choose the best language (tool) for the job.

By Function:

Creativity, for something new, to put little limitations on creativity of the language, the most crazy things: Clojure
Prototype, play around, do EDA, try out API’s, parse text, quick plot, quick everything: Python
Machine learning, most advanced, most accurate methods catalog, run a one only method: R
User Interface: Javascript + HTML

C/C++

Function: Fastest language, for serious systems, the traditional, safe and hard way.
Examples: Operating Systems.
Opinion: Old languages, although C++ tries hard to keep up. C is small and nice, but very low level, is close to the machine architecture (which is limiting, because you’ll have to mix and fit computer architecture to the solution of your problem, they should be abstracted and be different problems) C++ big, complex, a natural growth of C, also the most used language, because is fast, and it was maybe the 1st modern language. Is so complex that who uses it has hard time taking critics on it, because of investment already made.

R

Function: The most complete and free collection of statistical methods and libraries.
Opinion: Excellent for data analysis exploration, but very clunky programming language, probably result of many years of bolting on features (im guessing even by different people), instead of a clean designed planning.

Javascript

Function: User Interface, because of HTML, because of internet. The best platform for sharing and distributing creative visualizations / presentations.
Examples: Web, D3.
Opinion: Love/hate with this one, love the powerful constructs that trump abilities of big languages like Java. But inconsistent, and with many holes, that requires some overhead to avoid shooting yourself in the foot. Think it had the luck to be in the right place at the right time, with just the right amount of interesting features to make developers stick with it. Been growing a lot recently, has one of the most distributed language interpreter in history ? (the browser) Is cross platform, better solution, than developing platform dependent things.

Ruby

Function: small scripts to solve a particular local problem, often a better replacement to shell scripts, great prototyping, and rails for better web pages.
Opinion: Simple but powerful, does not get on the way. Perfect scripting and prototyping language, for small scale. Great to text processing, gluing different systems / apps together. Downside, smaller lib ecosystem than python (that fits in about the same target audience). Meta-programming facilities allows for creativity.

Python

Function: great for prototyping, keeps covering more and more fields over time, does everything now ? Taking over R soon for EDA ?
Opinion: Ruby’s older brother, more users, bigger ecosystem, more libraries, does more things, but not as pretty as ruby (I might be biased). Excellent scripting languages, good for fast developing, people find ways to fit it into every style and to cover more and more ground with the things it can do.

Java

Function: Build big scalable systems by big enterprise teams.
Opinion: Designed to be safe, to scale a lot, designed to be dumbed down (so that is safe), so has grown a lot, maybe top 3 biggest language in use? Known to drive people slightly crazy, has a huge overhead of configurations, setup, libraries dependencies, files, etc… Very portable. Very scalable.

Clojure

Function: The gate to the JVM libraries. (Scalable systems, processing large volumes of data, production / online systems, databases, queues)
Example: Cascalog Hadoop, Mahout.
Opinion: Allows for creativity, the basic syntax is super simple, so allows and incentives creativity in abstraction and composition, like a clean slate. Incentivizes functional programming, no state (or very little in reality).

Julia

Function: Appears to be best designed language for data crunching. (the new Fortran). Python was adapted, Java was adapted, Julia was designed for data crunching.
Opinion: not mature yet (in 2013). Hard to compete with python, if all libs needed already exist in python.

Reference:

http://blog.redowlanalytics.com/post/67465127385/tools-for-the-data-science-craftsman

Haskell

Function: General Purpuse, Expanding programmers skill and know how about programming. Expand the programming field research boundaries.
Opinion: from the math world, thus clean and cirurgical, born in academic grounds, viewed as a very noble and a holy grail of a language. The functional language.

reference: http://www.haskell.org/haskellwiki/Haskell_in_industry

Data Analysis Introduction

2013-03-25T00:00:00+00:00

Why, whats the point?

1. End Goal: Improve something

In business: increase revenue, customer loyalty, reduce costs, increase user count, avoid fraud. In government: increase business creation, reduce tax invasion, identify terrorism beforehand. Health: avoid diseases, prolong lifespan, increase general well being.

skills: domain knowledge(digital marketing, business internals), ask big scope questions to distill what is important: what is the purpose of this? What happens if we don’t have it?

2. By doing something new / different

e.g. Do Y instead of X see if improves against measures (KPIs)

skills: 1st need to know the current baseline: Dashboard (for KPIs monitoring), 2nd use AB testing to optimize.

3. Because of an insight / revelation

e.g. We found that Y is more likely better than X.

skills: Be curious, play around, make observations, question a lot, distill what is important, common sense, this can contribute to metrics(KPIs to monitor) but should be more focused on the ideas for change.

4. (An Insight) That was presented in a clear way

skills: visualization, communication, rhetoric, human behavior

5. (An Insight) That was analyzed in a correct way

skills: statistics, probabilities, counting, estimating, correlation, regression, modeling, etc…

6. (An Insight) That had adequate data available

skills: SQL, programming, google analytics, databases, etc…

Particularly in big organizations, that have many moving parts (variables) it can be very hard / impossible to predict the impact of a change. Very often the organization current status is a result of random lucky / unlucky events (and often external) at the right / wrong time. Typically the process of change(improvement) is a sequence of trial and errors until a change works out. So, instead of choosing changes “blindly”, data analysis can help with narrowing down the search space of potential choices and in focusing efforts in the right direction.

How ?

“What is the truth?” is an excellent rule of thumb to guide the data analysis practice.

Data Analysis top level typical tasks

Revealing the facts, knowledge not known before.
Optimization, once you know the facts you try to optimize them, AB testing. (the hypothesis are domain specific).
Monitoring, keep observing the updated performance and optimization results.
Estimating / Predicting Future, by observing the past.
Defining whats important to keep an eye on, the KPIs. (this is domain specific).
Telling a Story, communicating the findings and recommendations.

Typical flow of an Analysis piece

Collecting data: requires a mix of technical and analytical: by knowing comprehensively the technical details of how tool works, designing a instrumentation solution to allow useful analysis.
Finding Insights: Look into the data, explore, identify trends, find the most interesting actionable bits.
Telling a good story: Automated recurrent (daily/weekly) documents showing the KPI’s. Requires analysis up front to figure out what are the KPI’s to put on dashboard, and to pick the most actionable metrics.

The output of this work is a report. That either:

Is used to monitor performance and is used baseline for changes. - typically a recurring report.
Reveals new information that can be used to define strategies and future changes.

EDA - Exploratory Data Analysis

a.k.a Looking for insights

Rules of Thumb

Data Validation: Having 2 (or multiple) sources of data allows to calibrate if 1 data is looking good.

Data Validation: Mixing data from 2 different tracking systems is typically a no no. They will very likely be slightly different in absolute volume. A better way to compare is to check instead if trends are consistent.

Data Validation: To confirm an observation, re-run analysis from a different angle (in a different way) to see if they consistently match. (and try many angles, the more angles the higher the confidence). A form of data QA.

Getting to data: When having a problem to get the data, maybe there’s a different way to get to the same data? what is the closest data proxy to it ? although not perfect, but a useful / workable approximation. Don’t give up right away, there is often a way.

Getting Meaning: Is not the totals that matter the most, is instead the relative power they have. Prefer trends instead of absolute numbers. An absolute number is only meaningful when put in context (in a valid compassion).

Understanding a measured value (1 variable)

Histogram, to see the distribution of the data. - check if normal distributed
Regression, to model the data, typically over simplifies, but often is good enough.

Try out correlations

A way to find insights is to look for correlations Plot 2 variables against each other and see if any patterns appears, use a scatter plot. To see the pattern you often need to smooth the noise into a line that shows the pattern Smoothing (line) is about getting the right balance between smoothness and accuracy. Loess is a topical method (confirm a good fit from the residuals) Other tools for revealing patterns in plots:

Logarithmic scale, shows power laws, for data with many orders of magnitude
banking, turns humans recognize more easily curve slopes, when they are at a 45 degree angle

Linear regression is best to find formula for prediction. It really only works when the process generating the data really is a linear function: we have a set of inputs and we measure a set of outputs. It only works if data can be described in a straight line. Does quite a few assumptions and it produces a summary / aggregation only.

Multivariate Linear regression

Apply multiple linear regression to a set with many variables to find what variables influence the most the output. Then use standard confidence intervals calculation to check if that relation has enought confidence.

Time Series

Time series, is a bi-variable (2 variables) representation with a couple added properties.

Components of time series: trend, seasonality, noise, other ( missing values, outliers, etc…)

Data Validation:
Is the data fair for the whole time period is representing ? Or for example some definitions changed in the meanwhile, i.e. the rules of the game that generated data changed and could be biasing it? In practice is hard to make it work for many years in a row because of constant rules changes.

Data Validation: Data by Week: Comparing data by week, might be fairer than by day, because of weekly cycles, so removes daily bias.

A time series tasks is typically about:

description: past, trend, seasonality, changes in behavior
prediction: future, forecasting
control: present, monitoring over time

Smoothing is a way to eliminate noise. Running averages: weighted average is better than non-weighted. Use Gaussian weight functions. Exponential smoothing is even better, and allows forecasting (moving average do not). Use holts-winters method.

Beware of aggregated functions over non adjusted time periods. For example sum of sales in February is lower 10%. And business days, some months include 5 weekends, sum of sales on that month will look lower.

Correlation function, good for revealing how much memory is in data, and to reveal periodicity. (Wonder if i could build a periodic automatic indicator from a time series chart.)

Interesting enough, most machinery developed for DSP matches the same applications as time series operations. Signal is the time series, filters are operators like, smoothing, differentiating. And convolution: combining 2 sequences to yield a third one. Applying smoothing function to a chart is like convolution, augmenting a chart with a smoothed line, to produce a 3rd aggregated signal.

Back of the envelope estimations

When there is a hypothesis, about the data find back of envelope estimations to prove it, or to estimate is about right. Ex, When observing that Feb sales are lower and raising the idea (hypothesis) that is that is probably because of a shorter month. Calculate how big should be the drop, with 3 less days? 3 days is about 10% of 30, so sales drop should match 10% , to fortify the theory, and more important to not fail it.

Reference

http://www.gregreda.com/2014/03/23/principles-of-good-data-analysis/

Modeling

A representation of an existing behavior. Never perfect, but an attempt to be close. Can then be used to predict future behavior. Also to be used for optimization.

Statistical distributions

a distribution is a way to summarize a dataset.

Disciplines

Web Analytics

A tool for Improving a web site activities by doing better “Informed” decisions.

Aha, this is what’s happening with my site, right, I will then do something differently (or this actually confirms what I’m doing is already good).

What if I do this? What will happen to my sales ? To my site engagement?

Used on

Digital Marketing - ROI from marketing online campaigns.
Understanding User Engagement – How many users, is site growing, do users keep coming back (a proxy for user loyalty).
Front End Optimization – Finding the ideal layout, messaging, page flow, conversion buttons. (A/B testing)
“Informed” Change – I need/want to change the site, what’s going to be the impact ?
eCommerce - Optimize the online payment flow, find where users are getting stuck (keep an eye on mobile users), how to improve it further to increase conversions.

Digital Marketing

Tracking mail marketing flow: email -> page -> conversion funnel. Identify if maybe a step is specially bad at converting and go optimize that. e.g. Maybe that week a particular page in the conversion funnel was broken and the mail campaign didn’t convert as well as expected.

Understand the Audience

Who uses my product? And how? Use the audience data into next iterations of my product, to target it better:

Where do users drop off from my site? Is it a broken link? Bad messaging? Non working form ?
Do I need to support old browsers ?
What browsers should my QA process test on ?
What screen size should I develop my app for ?
Is the interest growing over time ? (traffic volume)
How are users reaching the site? brand search? are they looking for something specific? are the search traffic trends changing over time? should the site adapt to reflect the users searches?
etc…

eCommerce

Money is the #1 of any business, the web pages that support this process are central to it, and should get a good amount of attention.

Couple of idea to explore:

Look into the visitor days-to-purchase metric change when changing prices, probably it grows bigger.
If days-to-purchase are very big then means, users could be better informed up front. Make site better and measure again.

Optimization - Data Driven Change

Is introducing a new page with FAQ on a product result in improving sales ?
Is providing an alternative FB login on the login page, translating into more new users ?
Having a trusted logo on the credit card capture page improving sales ?
What is a better messaging for the site to use ?
Bigger Buttons ? Color of buttons, placement of the buttons, etc…

Machine learning

A program that learns (creates a model) from the existing data, useful for:

Predicting (guessing / inferring) future from the past
Automating things - a program that learns is then able to do new things without being explicitly programmed.

Contributes a lot for the modeling section.

Example simple: by noticing that from email history the spam emails almost always have the keywords “buy this”, is possible to infer that the next email with “buy this” is likely spam also and computer can automatically filter them.

Example complex: by recording all the details in how a person drives a car given the road ahead, a model can “learn” to drive a car in the same way.

An average is a model, simple but naive and fails with outliers.

A regression line is like an average for 2 dimensions, also a model.

Is also said that all models are wrong but try to get the closer as possible (e.g.minimizing the least square errors).

Andre Ng machine learning classes are excellent.

Latest big trend, on the start of 2013 is probabilistic programming, also called model-based machine learning: http://probabilistic-programming.org/wiki/Home

Nice video explaining: http://radar.oreilly.com/2013/04/probabilistic-programming.html

Markov chains explained: http://techeffigy.wordpress.com/2014/06/30/markov-chains-explained/

Other ML techniques: https://scottlocklin.wordpress.com/2014/07/22/neglected-machine-learning-ideas/

Work, Productivity and Tools

2013-03-15T00:00:00+00:00

Precepts of Work

What impact does it create ? (What world problem does it solve ?)
Long term, working on something relevant in 1, 5, 10 years?
Learn from and follow the top success people in the field.
Are are you contributing something back to the ecosystem that is helping you?
Build (& don’t burn) bridges. Cultivate friendships.
Allocate time to learn.
Allocate time for the slow but long-term important work. Don’t be trapped by the urgent short-term rat-race tasks.
Don’t work too much, brain needs down time, burning out slowly creeps in.
Say what you think =Be Assertive.
Step outside confort zone, have always a little challenge, never be too comfortable. too comfortable -> laziness -> no learning. Need to force expand the comfort zone
Be humble, but keep a balance between humble and over confident.
Sometimes is good to let ideas sit for a while and see if they still make sense in a few months. Especially the ones that need a big investment up-front.
Setup an environment for productivity: like my guitar rig, that is always kinda ready to go, thus makes it very easy to pick it up and play nearly every day.
Simplify, reduce - are great rules of thumb to follow.
Curating perfects a product, often 1st version has holes, maturity / curating is what makes it good and reliable. Important things need curation. But is not that all that is new is bad: they might be a game changers who’s immaturity might be minor nuisance compared to benefits.
What to output? An application, a physical product, a book, a code library? - A Paper! a paper is the explanation of a new idea that advances the field, a well explained paper / idea does not go old, like a software application goes.

Reference:

http://www.youtube.com/watch?v=5aH2Ppjpcho - Dan Ariely: What makes us feel good about our work?
http://www.paulgraham.com/todo.html
http://www.rosshudgens.com/thoughts-from-paul-graham/ - not only about work
http://brendansterne.com/2013/07/11/do-the-right-thing-wait-to-get-fired/
http://www.buzzfeed.com/mikespohr/37-things-youll-regret-when-youre-old
writing a paper: https://research.microsoft.com/en-us/um/people/simonpj/papers/giving-a-talk/writing-a-paper-slides.pdf

Precepts of Productivity

Brain dump, write down a todo list, define priorities, pick the one at the top and forget about them, don’t keep worrying about things. (see Kanban, GTD)
- Set notifications to remind important things (google calendar or outlook.com)
Kill notifications that can interrups: email / IM are a major source of distraction, schedule times for it, don’t be driven by it (is better to drive it)
Block quiet time (for a Maker’s agenda see reference)
Avoid Multitasking. Adds the overhead of context switching that drains attention / energy from the tasks at hand, same problem with meetings interrupting work.
Calculate how much time it takes, often we procrastinate something becuase we don’t feel like doing it. But often figuring out in 20 mins i’ll be done with it, helps.
Brain gets tired as day goes by (some say only 4-5 hours of full on concentration)
Health in general, and lack of sleep play a big role in productivity.
Keep a done list - when time is spent on something (> 30mins), write it down in a done list, per day, a quick 1 line description. This helps to avoid getting to end of week without remembering where the time went, great for postmortem analysis. - Seinfeld method, look at what has has been done and incentivates to keep it up.
Track how you work, pick up weak points & time wasters, AB test / experiment for a week alternatives way of working (less mail or no IM, or no internet, etc…) and compare productivity. (tools: manictime, rescuetime, etc…)

Reference:

http://abetterlife.quora.com/How-to-master-your-time-1
http://www.businessinsider.com/time-management-and-productivity-hacks-2013-4
https://www.quora.com/Tips-and-Hacks-for-Everyday-Life/What-are-some-uncommon-ways-to-work-smarter-instead-of-harder
http://www.paulgraham.com/makersschedule.html - Maker’s schedule vs Manager schedule - Paul Graham
http://www.bakadesuyo.com/2013/08/real-super-powers/
http://mariusandra.com/blog/2014/01/how-to-be-productive/
http://firstround.com/article/70-of-Time-Could-Be-Used-Better-How-the-Best-CEOs-Get-the-Most-Out-of-Every-Day

Precepts of Creativity

Writing is a great way to think about something.
Reading a book helps creativity. (Maybe because I do so little of it and it takes me away from the computer distractions).
Exercise (even just 20 mins, even walking), greatly boosts productivity / creativity.

Precepts of Tackling a Problem

Step away from the computer for important problems, computer is a major source of distraction, the analysis & design phase is key to decide how to tackle the problem, focus and write it down - this is the loading up phase of Hammock Driven Development
Don’t do important tasks all in 1 go, accept first interaction is not perfect. Ideally come back to it overnight - the background mind is better at synthesis, consolidating new information and making new associations.
Hurrying often overlooks the fine details and introduces errors - sloppy work. Ideal is quick and no-dirty, but often this is not possible.
When trying to get head around a complex thing, is better to document it, write it down in a clean way, for easier understanding. A diagram provides high level view for example. (probably related to the limitation of keeping 7 vars in the head). Mind Map is a great tool for this.
Practice to recognize complexity and how to avoid it and simplify it. Break a complex thing into many simple / independent composable parts. It is easier to work with simple - See Simple-Made-Easy talk.

Reference:

http://www.youtube.com/watch?v=f84n5oFoZBc - Hammock Driven Development - Rich Hickey

Precepts of Tools

Best tool for the job, a means to an end
Make each program do 1 thing well
Make each program a (standalone)filter
Be very familiar with just a few
Tools will change / deprecate: Avoid investing too much into tools (because will as side-effect resist change in future)
Prefer output to console instead of writing to file, easier to filter output, when used from other programs.
Get it running ASAP =agile
Do tests for important things, they can save you from trouble
Math is a great tool

Reference:

The Unix Philosophy: http://en.wikipedia.org/wiki/Unix_philosophy#Mike_Gancarz:_The_UNIX_Philosophy

Getting Things Done Process

Write down all the stuff to achieve an empty mind from worries
Review the written list, for each item
- what is the next step to do?
- if not actionable delete it (and save as reference).
- less then 2 mins then do it!
- longer than 2 mins defer for later into TODO list (defer tool: calendar reminders)
- Organize the TODO list by priority
- Do it

Tackling a project (a bigger task)

Purpose - the why
Outcome visioning - what should look like / behave.
Brainstorm solutions
Organize what (components) needs to be tackled in what priority.
Identify next actions

http://www.wikisummaries.org/Getting_Things_Done:_The_Art_of_Stress-Free_Productivity

Competition

Always thought that competition on money can also be a good thing (although is evil most of the time), because it is in an incentive for sellers to create disruptive technology to get the competitive edge. But just realized is not about the money in particular, is more about a specific “scarce resource”, any scarce resource will work just as well as incentive (as a goal) for competitive breakthroughs. Money is just the current civilization most valued resource. Also just realized money in itself is not evil, evil are (evil culture background) people who make unfair shortcuts to get more of the scarce resource (money), and potentially make other people’s life worst.

Summary

The problem seems to always come back to “evil (culture background) people”. Although competition and money seems to get blamed for it. Because money is where the most intense competition is going on and naturally that also attracts bad people.
Having a scarce resource, or even in more general terms “a goal” can be a great incentive for breakthroughs that can contribute to make everybody’s life better: Food, Cars, computers, Airplanes, etc…

Presentations

The key to a safe presentation, is in preparation, prepare the script that runs by the whole presentation beforehand, simulate in mind the previous day all the whole presentation. (Simulation is a great tool, not only presentations)

With experience the ability to improvise it, will increase and, thus requiring less up front preparation.

Math

Is a tool. Using Mathematics is about abstracting a problem into the math language (equations & functions).

And then using the Math language manipulation features to solve the problem (within the Math context, already abstracted away from the domain problem). It abstracts away from the domain problem into a Math problem, that can be manipulated all in Math context.

Using Math: first step is actually learning how to represent a problem in the math language. (for that you need to have a background know how of the language, but that background is actually not the purpose - math education nowadays).

Health, Food & Exercise

2013-03-11T00:00:00+00:00

Rule of Thumb

Our bodies are result of evolution over thousands of years on a specific diet and on a specific type of physical exercise, and that is what the body was “trained” to understand better, so the closer we live to that the more likely to be healthier.

Experiment, although every body has common needs, every body has unique needs also, some will be more deficient in greens, some more deficient in proteins, some saturated with cholesterol, etc… so test and experiment for your unique case and your background to find what works best. Same goes for sport.

On the Plate

1/2 vegetables and fruit
1/4 protein: fish, (white)meat, nuts, beans
1/4 whole grains

Reference:

http://www.hsph.harvard.edu/nutritionsource/pyramid-full-story/files/2012/10/healthy-eating-plate.pdf

General Food Guidelines

Almost anything that is industrially processed / refined is bad, refined flour, artificial flavorings, etc… General Rule, the less industrially processed the better.
Deep reds natural foods are healthy and prevent cancer, many antioxidants - berries, raisins, pomegranate.
AB test routines, foods, cooking methods, etc… try find what better works for you. Every body is a tiny bit different.
Depression, anxiety disorders seems to be correlated to bad gut = a damaged / unhealthy intestinal track for prolonged periods of time. Good bacteria helps recover the gut, go for natural fermented foods: sauerkraut, keefir, yogurt.
Sleep is underrated, make an effort to sleep well.

Cooking

High temperatures cooking often alters chemical properties and if overdone can became harmful. Keep the cooking temperature low. (see sous-vide cooking)
Soups(made at home) can be very nutritious. Outside(of home) are not mandatorily healthier, because can have blended anything into them..

Loosing Weight

To loose weight, eat less calories than body consumes:

calculate how many calories your body consumes
count / estimate the calories of your food
make sure you eat less then you consume
increase body calorie “spending” by exercising

Food and Nutrients:

Calories, what makes gain weight
Protein, makes feel full
Reduce portion size, smaller plate
Soups are great, for filling
Calcium blocks fat from being absorbed: whats is best yogurt or keefir ?

Note that loosing weight is normally healthy, but not mandatory. When loosing weight too fast, beware of nutrient deficiency.

Observation: losing weight makes one smarter. I’ve heard this explained by evolution: long time ago we sometimes found ourselfs without food, and in those times naturally the brain gets “smarter” in order to solve the food problem.

Reference:

http://www.reddit.com/r/loseit/wiki/faq
http://www.youtube.com/watch?v=xS8nt0qoRuc - BBC how to be slim
http://matt.might.net/articles/least-resistance-weight-loss/

Low Carb Diet

Great to loose weight, lack of carbs forces body into ketosis state, which burns fat instead of carbs, causing weight loss.

Avoid carbohydrates: rice, potato, bread, pasta, flour
Avoid sugars. (Fruit slows down the weight loss, but is healthy)
Use protein to feel full: meat, egg white, fish
Legume(beans) contains protein and good nutrition

As with anything else, too much of 1 thing might be bad if prolonged for too long, so probably better for only a limited period of time.

Beware very monotone diets can lead to specific nutrient deficiency.

Digestion

Order of food matters. Mixing slow digestion food with fast digestion, will force some food to stay longer in intestinal track than it should. Food starts to go bad, and gets absorbed into blood stream.
Fast to Slow digestion: water, fruit, vegetables, legumes, chicken, whole milk, pork.
Good trick to fill full: soups. Water by itself goes fast in intestinal track, but if mixed into a soup with other food, it will remain in stomach longer, and signal fullness.

Sport & Exercise

Intensity: short but very intensive exercises, seems to yield the best results.
Fun: Try different types of exercises, go for the one that is more fun to do (the more fun the more likely to stick).
Experiment: everybody will be deficient in some parts more than others, due to routine, and habits, so test&experiment for your body to find what works best for you.

Reference: The Truth About Exercise http://www.youtube.com/watch?v=tyQSzx0ofto

Body Recovery

How 2 Stay Young & Beautiful - http://www.youtube.com/watch?v=AMC-D7uu3L8

Vitamins

A: avoid in excess, enough from food
E: ok to take
C: not harmful, but probably getting all from foods

Truth about vitamins: https://www.youtube.com/watch?v=zp7WdxvoBfI

Cancer

Phases

Initiation: happens from bad chemicals that get to our system.
Germination, needs animal protein, and can be dulled by removing animal protein.
Adult stage, grown cancer

Blood cholesterol is a key indicator, and strongly related to heart disease and for cancer development.

Animal products are strongly correlated with dietary fat intake, animal products are very fat by default

Reference: “the china study” book.

Avoiding Burnout

http://andrewdumont.me/avoiding-burnout

Meta Learning

2013-03-07T00:00:00+00:00

How to Learn

Deconstruction:

Reduce content to its minimal moving parts (deconstruct, simplify, zoom out, distill).
Start with the outcome; start at the end and work your way backwards
Viewing the subject from a variety of perspectives
Walk from the end goal backwards, recursively asking “Why?”
Looking at what successful outliers are doing
Probing the minds of experts through interviews
Finding simple commonalities in a domain that can serve as a key to accelerate learning

Selection: analyse and find the common features (80/20)

Choose the 20% that covers the 80%
The Minimal Effective Dose (MED): “The lowest volume, the lowest frequency, the fewest changes that get us our desired result.”
The what(material) is more important than the how(method)

Sequencing:

Choose the right order to study / learn the material
Make it stick first, start with fun and simple

Stakes:

Loose something if miss it

Compression: one-pager’s cheat sheet

The Prescriptive One-Pager lists rules or principles that help you generate real-world examples.
The Practice One-Pager lists real-world examples to practice, which helps you learn the principles indirectly.

Frequency: breaks / intensity(immersion) / expected progression

Plan a study/practice schedule that provides the frequency needed to gain competency.

Encoding: zip / memorization tricks

Find ways to associate the knowledge and skills with what you already know.

Also called space-repetition program. A practical way to do this is to create space-repetition decks and put them into a spaced repetition software, and just practice with the a few minutes every day.

reference:

Memorizing a programming language using spaced repetition software: http://sivers.org/srs
Space-repetition for win8 phone: http://www.kleio.info/

Do a Plan

Take into account the above rules of thumb to define a plan on how to tackle:

What to study
- Look at the end goals and deconstruct - to define what to study
- Select: apply 80/20 rule for selection and MED
- Find ideal the material, interview Pro’s
When to study it & Expected Outcomes
- Look into the sequencing principles - organize an ideal “growth” sequence.
- Define clear dates & outcomes for the sequence - applies the stakes principle.
During the study (and afterwards when practicing)
- Summarize learnings in cheat-sheets, can be part of the expected outcomes - keep refining them
- Potencially use memorization / encoding tricks if/when required

Notes

Explaining to others, forces to find clear explanations and interiorize the learnings
Writting often helps thinking and developing subject further
When learning be pro-activity, especially by asking many questions

References

The 4-Hour Chef Book
http://www.fourhourworkweek.com/blog/2012/12/11/how-to-play-the-guitar/
http://theelearningcoach.com/elearning_design/isd/metalearning/
http://theelearningcoach.com/elearning_design/chunking-information/
http://marc-edwards.com/2013/01/6-steps-to-learn-master-anything/
http://domenicodefelice.blogspot.it/2013/06/best-ways-to-learn-foreign-language.html
http://ankisrs.net/ - Remembering things: Space repetition software

Javascript Bookmark Scripts

2012-12-05T00:00:00+00:00

I often use scripts that inject javascript and modify current page im looking at, mostly for inspecting/debug reasons, for example the WhatFont, that reveals what fonts the page is using. These scripts can be triggered from a browser bookmark, so that they run with just a click(very convenient).

Pretty.js

Here’s an example. This script finds programming code placed on a web page and makes it pretty, using the google pretiffy syntax highlight javascript lib.

// A. Add as browser bookmark:
// javascript:(function(){var d=document,s=d.createElement('scr'+'ipt'),b=d.body,l=d.location;s.setAttribute('src','https://raw.github.com/gist/4213877/pretty.js');b.appendChild(s)})();

// B. Loads javascript
// note that callback function needs be before calling loadJS(or inline)
// loadJS("http://code.jquery.com/jquery-latest.js", function() { $('my_element').hide(); });
var loadJS = function (url, callback) {
  var head = document.getElementsByTagName("head")[0];
  var script = document.createElement("script");
  script.src = url;

  // Attach handlers for all browsers
  var done = false;
  script.onload = script.onreadystatechange = function() {
    if( !done && ( !this.readyState             || 
                    this.readyState == "loaded" || 
                    this.readyState == "complete")) {
      done = true;

      // Continue your code
      callback();

      // Handle memory leak in IE
      script.onload = script.onreadystatechange = null;
      head.removeChild( script );
    }
  };

  head.appendChild(script);
}

// C. main function to run after js files are loaded
var run = function () {
  // include pretiffy.css
  $('head').append('<link rel="stylesheet" href="http://al3xandr3.github.com/css/prettify.css" type="text/css" />');
  // find all elements with pre (typically where code is)
  $('pre').addClass("prettyprint");
  // make pretty !!
  prettyPrint();
}

// D. jquery and take it from there
loadJS('http://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js', function () {
  //when finished loading jquery load prettify.js, then run
  loadJS('http://al3xandr3.github.com/js/libs/prettify.js', run);
});

How it works:

The browser bookmark code(in A.) loads the pretty.js a (github)gist hosted script.
The pretty.js script itselfs loads external jquery.js and prettify.js libs(in D. by using B.)
When external libs are fully loaded it runs a bit of code that for every pre html tag(tipically where code exists) makes it pretty(syntax highlighted). (in C.)

Pretty.js is hosted here: https://gist.github.com/4213877

Notes:

A. The code to be used in the URL field of the browser bookmark. Note that depending on the url you use from gist, you can choose to share a versioned version or the lastest one.
B. Is a generic function to load external javascripts, it waits until the external script is fully loaded and then runs a function(called run in this case). See how is used in D.(loads 2 external js files in sequence and then calls run).
C. Is the function to run when external scripts are loaded on page.

I have also put toguether a javascript growl type alerts as a gist, to be used in bookmarks like these: see here

I imagine this is a very common pattern for this type of script, so feel free to use, re-use, modify.

Kudos for the github:gist service! (it even includes versioning !!)

Note

Here’s is a nice collection of ready made bookmarklets: https://github.com/yaph/bookmarklets

Data Processing Techniques

2012-11-09T00:00:00+00:00

Reference of data processing techniques.

filter

Exclude lines from file.csv that contain a “.” and output the result into file2.csv

$ cat file.csv | grep -v '\.' > file2.csv

Usefull if you need to filter a big text file (that text editors arent able to handle).

split

Split file.csv in files of size 1000 lines each

$ split -l 1000 file.csv

Usefull if you need to open a file in excel that is bigger that excel allows. Split it into multiple files.

count lines

$ cat file.txt | wc -l

show the first (and last) lines of a big file

Show the 1st 5 lines in file.txt

$ head -5 file.txt

Show the last 5 lines of a file

$ tail -5 file.txt

R <-> Excel via clipboard

Especially made to interact with Excel(adapt for other apps), inside R do:

### Column
# read from clipboard a column of strings into array x
x = readClipboard()

# read from console input a column of numbers into array x
x = scan()

# send clipboard a column of strings in x
writeClipboard( as.character(x) )


### Table
# read from clipboard into table x
x = read.DIF(file='clipboard', header=TRUE, transpose=TRUE)

# write table x to clipboard
write.table(x, "clipboard", sep="\t", row.names=FALSE, col.names=TRUE)

reference: johndcook r_excel_clipboard

(ruby) Script - For each line in file Do (something)

Using ruby, gets flexible to do a lot:

File.open(ARGV[0], 'r').each_line do |line| 
  puts line.split(",")[0].gsub('"', '')
end

regular expressions

Some might not always exactly as they are depending on tool

^\n - lines starting with newline = empty lines

Excel: time hh:mm to minutes(or hours)

To minutes(there’s 1440 minutes(24 hours) in a day, excel keeps time fraction between 0 and 1), so for minutes:

=1440*mod(A1,1)

If you are working with time from a single cell only, be a bit careful. A hh:mm format will indicate only the fractional part of the DATE stored in the cell. That is, 0.24 an 1.24 and 18,000.24 will all display the same. To make sure you are getting only the fractional part (if that is what you want) use mod.
Can also get the hour by doing: =24*mod(A1,1)

Output:

| before   | after  |
| 01:47:41 | 107.68 |

Convert text file encoding

Many text editors support doing this, but when the file is too big use instead the command line.

In windows powershell(from utf-8 to ascii)

gc -en utf8 in.txt | Out-File -en ascii out.txt

reference(for *nix machine also): best-way-to-convert-text-files-between-character-sets @stackoverflow

Excel: pivot table count distinct

Insert a column and in Cell C2 paste this formula:

=IF(SUMPRODUCT(($A$2:$A2=A2)*($B$2:$B2=B2))>1,0,1)

Reference: http://stackoverflow.com/questions/11876238/simple-pivot-table-to-count-unique-values

Excel: Finding the last filled row, for a given column

Useful for automating excel reports, imagine keep adding rows to an existing table to add in more data, often a report need to calculate a value considering all exiting rows.

A way to do this, for column B:

=MAX((B1:B1000<>"")*(ROW(B1:B1000)))

Looks at rows 1 up to 1000 and finds the last filled value. Does an array calculation, from 1 to 1000. Calculates a vector of true or false (whether is empty or not) and then multiplies with another vector that contains the row numbers obtaining a vector with the row numbers of the filled values. Then gets the Max out of that.

Enter this formula and then do: Shift + Ctrl + Enter This last command is required because this is an array calculation, instead of a call calculation (the most common).

Excel: value of a cell given a text coordinate

=INDIRECT("E"&C7)

Returns the value in cell EX, where x is a value inside cell C7. Useful when calculating dynamic ranges, for automatic updating reports.

Excel: sum the last 30 rows for a given column (dynamically, when data is added to bottom)

Find the Index of the last row not empty:

=MAX((B1:B1000<>””)*(ROW(B1:B1000)))

place in cell Q2 (for calculation bellow)

For column B, sum last 30

=SUM(INDIRECT(“N”&(Q2-30)&”:N”&Q2))

Excel: Using dynamic ranges for charts/ pivot tables, etc..

Great for automating Excel Reports.

See here: http://www.analyticsvidhya.com/blog/2014/09/automate-excel-models-reporting-dynamic-range/

Create named dynamic ranges. ex:

  =INDIRECT("Steps!$D$12:$D$"&INDIRECT("Steps!$C$8"))

Reference them in charts for example. (use to list them to use F3)

Excel: Pivot Tables

Renaming a field in pivot table is not allowed “PivotTable field name already exists”: add a space at start (or end). ex: “Sum of MyField” -> “ MyField”

Pivot Table field has a “(blank)” or some other thing you don’t want in final report:

rename it to a space (“ “)
order the column so that spaces appear at end of start

Excel: Better number formatting

Format a number to have thousands (K) and Millions as (M). 7.6M instead of 7,592,712. works also for thousands (K)

[>=1000000] $#,##0.0,,"M";[<1000000] $#,##0.0,"K";General

Download a site

  wget -m -p -E -k -K -np http://www.somesite.com/

“-m” mirror site.
“-E” put .html on unrecognized fiels.
“-k” convert the links in the document to make them suitable for local viewing.
“-K” When converting a file, back up the original version with a ‘.orig’ suffix.
“-np” Do not ever ascend to the parent directory when retrieving recursively.

References:

command line Unix like tools in windows can be found for example in: Git for Windows shell (bundles minGW)
Top 10 Unix Command Line Utilities 2012
http://datavu.blogspot.com/2014/08/useful-unix-commands-for-exploring-data.html

Confluence automation

2012-04-02T00:00:00+00:00

Here’s a script that i use to automate updating an Confluence wiki content. Posting it here in case is usefull for someone else. I developed it on Windows7 and with ruby 1.9.3.

Can be used as a standalone, like so:

# for content:
$ ruby confluence.rb post "{html}<h1>Hello Confluence</h1>{html}" "page" "space" "username" "password" "confluence.my.com"

# for attachments:
$ ruby confluence.rb attach file.pdf "application/pdf" "page" "space" "username" "password" "confluence.my.com"

Or as a ruby lib, like so:

require 'confluence'

# for content:
Confluence.post "{html}<h1>Hello Confluence</h1>{html}", "page", "space", "user", "pass", "confluence.my.com"

# for attachments:
Confluence.attach "report.pdf", "application/pdf", "page", "space", "user", "pass", "confluence.my.com"

Code

module Confluence
  require 'xmlrpc/client'
  extend self

  # Confluence.post "{html}<h1>Hello Confluence</h1>{html}", "page", "space", "user", "pass", "confluence.my.com"
  def post content, page, space, user, pass, server
    confluence = XMLRPC::Client.new2("https://#{user}:#{pass}@#{server}/rpc/xmlrpc")
    # disable certificate check    
    confluence.instance_variable_get(:@http).instance_variable_set(:@verify_mode, OpenSSL::SSL::VERIFY_NONE)
    # shortcut
    confluence = confluence.proxy("confluence2") # confluence1 with older

    pa = confluence.getPage("", space, page)
    pa["content"] = content
    confluence.storePage("", pa)
  end

  # Confluence.attach "report.pdf", "application/pdf", "page", "space", "user", "pass", "confluence.my.com"
  def attach file, type, page, space, user, pass, server
    server = XMLRPC::Client.new2("https://#{user}:#{pass}@#{server}/rpc/xmlrpc")
    # disable certificate check    
    server.instance_variable_get(:@http).instance_variable_set(:@verify_mode, OpenSSL::SSL::VERIFY_NONE)
    # shortcut
    confluence = server.proxy("confluence2") # confluence1 with older

    token = confluence.login(user, pass)
    pa = confluence.getPage(token, space, page)

    attachment = {}
    attachment['fileName'] = file
    attachment['contentType'] = type

    # Read updated local copy of the document
    data = ""   # data array does not work, init as string instead
    f = File.open(file, "rb")  # don't forget the 'b' for binary
    f.read.each_byte {|byte| data << byte }
    f.close

    confluence.addAttachment(token, pa['id'], attachment, XMLRPC::Base64.new(data))
  end 
end


if __FILE__ == $0
  if ARGV[0] == "post"
    if ARGV.size != 7
      puts '$ ruby confluence.rb post "{html}<h1>Hello Confluence</h1>{html}" "page" "space" "username" "password" "confluence.my.com"'
    else
      Confluence.post ARGV[1], ARGV[2], ARGV[3], ARGV[4], ARGV[5], ARGV[6]
    end
  end

  if ARGV[0] == "attach"
    if ARGV.size != 8
      puts '$ ruby confluence.rb attach file.pdf "application/pdf" "page" "space" "username" "password" "confluence.my.com"'
    else
      Confluence.attach ARGV[1], ARGV[2], ARGV[3], ARGV[4], ARGV[5], ARGV[6], ARGV[7]
    end
  end    
end

It might change/evolve over time, check it out in my github ruby repo.

How to get into the Semantic Web

2012-01-11T00:00:00+00:00

Practical examples on how to get onto the semantic web and on using it.

Getting in There

Start by creating a semantic web personal online profile using the Friend-Of-A-Friend (FOAF) vocabulary. The FOAF has became the standard for personal profiles on the semantic web, and as the name implies, it also lets you link to people you know.

I used the foaf-a-matic online tool and then uploaded the results to my site’s foaf.rdf. Easy.

With this, can already use the semantic web query language, called SPARQL, to inquire about what it knows about me:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?property ?value
FROM <http://al3xandr3.github.com/foaf.rdf>
WHERE { 
  ?me foaf:name "Alexandre Matos Martins" .
  ?me ?property ?value .
}

run on sparql.org →

Why Another Online Profile?

How many times have you filled in your personal profile information on web sites? Google+, Facebook, YouTube, Yahoo!, MSN, Blogspot, Amazon, Twitter, LinkedIn, Flickr, Tumblr, Ebay, mySpace, hi5!, etc… How many times more we need to do it again?

The semantic web is about sharing data in an agreed upon format, so that the data can be easily linked-to and (re)used. Thus, once my profile is on the semantic web any new site that I sign-up for, can just read-in this data instead of asking me to fill it in.

Sharing data, in an agreed upon format, is an incentive for re-use and disincentive for wasteful duplication - #semanticweb

Adding the Site

Next step is to add the web site onto the semantic web.

Augmenting a web site content with semantic data, facilitates data sharing, essentially web pages became little standalone data repositories that can be understood by the semantic web tools.

The way to do it is simple enough; add (invisible)html properties into the existing web pages that specify (the semantics)meaning of the html elements.

These extra html tags, that add meaning to web pages, are defined in a microformat called RDFa, quoting Wikipedia:

RDFa defines how to embed RDF subject-predicate-object expressions within XHTML documents, it also enables the extraction of RDF model triples by compliant user agents.

For example, on the About section of my site, we can define rdfa tags saying “Alexandre Matos Martins” is my name and that I am of type Person, a link to my my foaf profile and my online accounts like Twitter and Skype:

Html:

<h2>About</h2>
<div about="http://al3xandr3.github.com/#me" typeof="foaf:Person" property="rdfs:seeAlso" content="http://al3xandr3.github.com/foaf.rdf">
  <a rel="foaf:OnlineAccount" href="http://twitter.com/al3xandr3">twitter</a>
  <a rel="foaf:skypeID" href="skype:al3x.martins?userinfo">skype</a>
</div>

With the rdfa tags on the site is then possible to use generic semantic web tools to query website data.

For example, to find the topics of a site(whats the site about), just aument the html list of topics of the site with rdfa tags, like so:

Html:

<a rel="dcterms:subject" href="/tags/SPARQL.html">SPARQL</a>
<a rel="dcterms:subject" href="/tags/data.html">data</a>
<a rel="dcterms:subject" href="/tags/abtesting.html">abtesting</a>

This allows, to Query It:

PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT ?subject 
FROM <http://www.w3.org/2007/08/pyRdfa/extract?uri=http://al3xandr3.github.com/>
WHERE {
 <http://al3xandr3.github.com> ?predicate ?subject . 
 ?s dcterms:subject ?subject .
}

run on sparql.org →

RDFa augments web pages as standalone data repositories that #semanticweb can understand, doubling as normal web pages, is like web scraping done right

Using It

Why is all this data useful? well for a more futuristic good use case check the Data, Data, Data! semantic web use case on Xmas.

But in the meanwhile, we can already play around with more mundane things, for example, predicting how likely is it, that i will write a twitter quote for any given day.

I collected a few of my twitter quotes on the quotes page and each quote has an rdfa date on it.

Html:

<blockquote about="/semanticweb.html#2011-12-21" property="dcterms:date" datatype="xsd:date" content="2011-12-21">
<p property='dcterms:description'>data,data,data! a #semanticweb use case on Xmas: <a href='http://al3xandr3.github.com/2011/12/18/data.html'>al3xandr3.github.com</a>
</p></blockquote>

<blockquote about="/semanticweb.html#2012-01-10" property="dcterms:date" datatype="xsd:date" content="2012-01-10">
<p property='dcterms:description'>Sharing data, in an agreed upon format, is an incentive for re-use and disincentive for wasteful duplication. #semanticweb
</p></blockquote>

etc...

So we can use the following sparql query to fetch directly from the quotes page, the dates and how many quotes, on each date, I’ve wrote:

PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT ?date (count(?subject) AS ?total)
FROM <http://www.w3.org/2007/08/pyRdfa/extract?uri=http://al3xandr3.github.com/pages/quotes.html>
WHERE { 
  ?subject dcterms:date ?date .
}
GROUP BY ?date
ORDER BY ?date

run on sparql.org →

Then i see the day-of-the-week for each of those dates and sum up the number of quotes per day of the week.

Having this, I can calculate the probability(the expected value) for each day, and can then just lookup the probability for any given day.

For a full (a)live data experience this is implemented in javascript that fetches the data when this page is opened.

I use jquery .ajax to go fetch the data of the sparql query defined above, do some data manipulation, plot it using d3.js, and finally output the prediction.

Look at the source code of this page, to see how it works.

Quotes per day of the week:

For example on Thursday is 21% likelly that I’ll tweet.

References

For Sale

2012-01-01T00:00:00+00:00

[SOLD] Behringer DD600 delay pedal

24-bit high-resolution
stereo delay/echo/panning
delay time up to 2sec
11 different modes

As new.

official site / youtube

original price: ~~[40€(new)](http://www.thomann.de/de/behringer_dd600.htm)~~
selling for: 29€ ( 23% discount )

[SOLD] Orange Tiny Terror

All valve, 15 watt portable guitar head. Featuring a unique two stage pre-amp which has a massive tonal range using just three controls. When driven, this little amp has almost as much gain as most four stage lead channels!

Bought in 2008, only used at home and not very often. Includes Gig bag, and still with original box.

official site / youtube

original price: ~~420€ ( new )~~
selling for: 270€ ( 35% discount )

[SOLD] Line6 M9 Stomp Box

Bought in 2010, only used at home and not very often, almost as new

official site / youtube

original price: ~~385€(new)~~
selling for: 250€ ( 35% discount )

[SOLD] Framus FR212 CB

2x 12” Celestion Vintage 30 speakers, 120 W mono into 8 Ohms, open or closed back via removable rear pane.

Bought in 2008, only used at home and not very often, almost as new

official site / youtube

original price: ~~[409€(new)](http://www.thomann.de/de/framus_fr212_cb.htm)~~
selling for: 215€ ( 47% discount )

[SOLD] Boss ME-30 multi-effects unit

Oldie, but great sounding unit, try it out!

youtube / manual

35€

[SOLD] Line6 POD2.0

The industry standard for direct recording in the studio

official site / youtube

60€

[SOLD] Tapco MIX 100 Ultra-Compact Mixer

Bought in 2008, rarelly used. 10-channel Compact Mixer with 2 Mono Mic/Line Inputs, 4 Stereo Inputs, and 1 Aux Send. 2x 48v Phantom Power for professional-level condenser microphones.

original price: ~~58€(new)~~
selling for: 25€ ( 40% discount )

5-Port 10/100Mbps Ethernet Switch

Not used, still in original box

official site

10€

Yamaha Short Cymbal Boom/Arm

Model: CH750

30€

Audio Cables

2x Guitar Cable, Stereo, 1/4” plug, 2m - 6€
2x Guitar Cable, Stereo, 1/4” plug, 4m - 9€

Contact me:

mail: al3xandr3@gmail.com
skype: al3x.martins

Data, Data, Data!

2011-12-18T00:00:00+00:00

Once Upon a Time…

After a big meal at Xmas

computer: Given your heart condition and age your risk of heart attack has now increased from 1 to 5% and is estimated to increase to 10% in 2 weeks continuing at this rate. Your Family doctor has been noticed.

person: f*k, why, how?

computer: excessive ingestion of foods with fat and cholesterol leads to increased risks. Your meals for these past weeks have contributed to that. It was about the same last year on this Xmas season. And is a quite common across the world for most people.

person: what now?

computer: Recommend better nutrition. I estimate that ideal diet with exercise can reduce the risk close to 1% in about 2-3 weeks. Should i create a recovery plan?

person: yes

computer: and order food from local shop?

person: yes

computer: and book gym time?

person: oh god, have to go to gym? I hate gym

computer: for faster recovery physical activity is recommended. Without it, i estimate recovery will take longer, but if you prefer, in the next few week there will be in your area: a wilderness day and a walking activity.

person: Yes, lets do that instead.

computer: ok, you are signed up for them. I will also be guiding your meals and sport activity for the next weeks.

person: thanks

A few weeks after at the workplace

computer: Your heart attack risk has decreased to close to 1%. Recover plan has been successful, you are now back to normal healthy levels.

person: Whohoo!

computer: Do note that your sugar levels are presently low so your intellectual productivity is estimated bellow 40%, recommend stop for the day and go get some food.

person: ok thanks, but one last thing, you mentioned that the increased health risks around Xmas season is a common problem ?

computer: yes, a world wide problem in fact. Do you want to get the full report?

person: No. just wondering if we can do something about it… Can you run some simulations?

computer: yes

person: For example, what are the impacts of a campaign on tv about the health risks just before the Xmas season…

computer: Simulation estimates that it would improve global health 2% and reduce global hospital costs about 3% for the next year.

person: How about a Santa visit to schools including a nutrition lecture? Or school teachers nutrition training? Or a new tv cartoon themed around healthy food? Or a reduction on healthy foods tax for this season? Or an increase in unhealthy food tax for this season? Or any mix of these… Run the simulations on the impacts.

computer: done.

person: ok, please generate an action plan and forward it to all members of the assembly for voting.

computer: will do Governor

all lived happily ever after… the End

Data, Data, Data!

See where this is going? data, data, data! that contributes to improve our lives In health, government, business, education, etc…

In the truly data driven future:

All data will be collected automatically, by the devices around us, all new data is automatically linked with existing one and made available to be used by everyone, everywhere.

Many data aware systems will exist and constantly improve themselves: correlating, mix and matching data, inferring and generating new knowledge by using the newly generated data that keeps on springing all around us.

There will be data available about each person, each government, each household, each company, each car, each food, each animal, each school, each computer… essentially everything and it will be collected all the time(almost in real time) by all kinds of gadgets.

With this information we will be able to have computers that can quickly see where a specific trend is going and suggest real time actions to ensure optimal effects.

Privacy

Might sound scary all this personal data floating around, what if it gets used in a wrong way? guess what… data aware systems everywhere will itself prevent fraud and misuse of information: any kind of bias, discrimination is immediately picked up and identified.

Also I’d argue further that this won’t actually happen because in a very data driven education any wrong influences are quickly identified and corrected early in education process. Thus the tendency is that among the grown ups the amount of wrong doers will be very small.

How we go about it? To start with, the data sharing part is key and is currently under-developed (or just not wide spread yet). The usual approach nowadays is to collect everything into 1 gigantic private silo and then use that.

We need to get to the point where all new generated data is available, linked and automatically understood by all other data systems, so we need a lingua franca for data formats and data sharing that everybody talks and understands.

So, as a practical 1st step, introducing the “Semantic Web” by Tim Berners-Lee:

A web of data that can be processed directly and indirectly by machines. @wikipedia

The Semantic Web will make this (data sharing) possible, by providing an open format for the representation and exchange of knowledge and expertise. @lifeboat

The Semantic Web changes the economics of processing knowledge. @lifeboat

A start. And in fact, the image at the top of this post is a diagram of the already existing semantic webs. @lod_cloud

Table.query

2011-10-14T00:00:00+00:00

A small library that given a set of data it transparently inserts it into a sqlite database and then allows for very flexible querying (SQL). Usefull for ad-hoc data tasks that read data from some file or api that require quick analysis. Adding data into a sqlite database will also allow for the data to be bigger than memory can fit. Inspired by R’s sqldf.

Use Case: Parse data from a log file or a web service and then need to do some data manipulation and summaries like: joining with other data, filtering, pivoting (group by), augment data with calculated columns, calculate sums, counts, averages, etc…

Leverages the power of sql for data analyses inside ruby with a minimal API:

> require 'table'
> Table.new ["user", "value"], [["bob", 3], ["eve", 1]], "tbl"
> Table.query "select sum(value) from tbl"
[[4]]

Features:

automatically infers the data type (numeric vs text)

shortcut to get a column

 > tbl = Table.new ["user", "value"], [["bob", 3], ["eve", 1]], "tbl"
 > tbl.user
 ["bob", "eve"]

augment your data by adding columns:

 > tbl = Table.new ["user", "value"], [["bob", 3], ["eve", 1]], "tbl"
 > tbl.add "sex", ["male", "female"]
 > Table.query "select user, sex from tbl"
 [["bob", "male"], ["eve", "female"]]

by naming the query(value at the end) create a new table:

 > Table.new ["user", "value"], [["bob", 3], ["eve", 1]], "tbl"
 > tbl2 = Table.query "select sum(value) as total from tbl", "tbl2"
 > tbl2.total
 [4]

direct access to db driver when needed:

 > Table.new ["user", "value"], [["bob", 3], ["eve", 1]], "tbl"
 > Table.with_db {|db| db.execute("update tbl set value=5 where user='eve'") }
 > Table.query "select sum(value) from age"
 [[8]]

persists in file table.db thus also means is then accessible by other tools; R, Excel, Java, etc…
small :)

Code

require 'sqlite3'

class Table
  attr_accessor :name

  def self.with_db(&block)
    db = SQLite3::Database.new("./table.db")
    yield db if block_given?
    db.close  
  end

  def self.query q, new_table_name=nil
    if new_table_name
      val = []
      Table.with_db {|db| val = db.execute2(q) }
      Table.new val.shift, val, new_table_name
    else 
      Table.with_db {|db| return db.execute(q) }
    end
  end  

  # new
  def initialize header, data, name
    @table = name

    # sql to create new table
    sql = "create table #{@table} ("
    header.each_with_index do |h, i|

      # iterates each row, looking for a non empty value
      val = ""
      data.each do |row| 
        if row[i] != nil and row[i] != ""
          val = row[i]
          break
        end
      end

      if val.class == Fixnum or val.class == Float
        sql << "#{h} NUMERIC,"
      else 
        sql << "#{h} TEXT,"
      end
    end
    sql << ");"

    Table.with_db do |db|
      db.execute( "drop table if exists #{@table};" ) # remove if exists
      db.execute(sql.gsub(",);", ");"))               # create new table  
      data.each do |row|                              # insert data
        db.execute( "insert into #{@table} values ( '#{row.join("','")}' );" )        
      end
    end
  end

  # column  
  def method_missing(m, *args, &block) 
    Table.with_db do |db|
      return db.execute( "select #{m} from #{@table}" ).flatten
    end
  end

  # add  
  def add col, data
    Table.with_db do |db|
      if data[0].class == Fixnum or data[0].class == Float
        db.execute( "alter table #{@table} add #{col} NUMERIC;" )
      else
        db.execute( "alter table #{@table} add #{col} TEXT;" )
      end

      # add specified value to each row
      data.each_with_index do |val, i|
        db.execute( "update #{@table} set #{col} = '#{val}' where ROWID = #{i+1};" )
      end
    end
  end

  # list tables
  def list
    Table.with_db do |db|
      return @db
        .execute("select name from sqlite_master where type='table' ORDER BY name;")
        .flatten    
    end
  end
end




if __FILE__ == $0
  require "test/unit"
  class TestTable < Test::Unit::TestCase

    def setup
      require 'fileutils'
      File.delete "./table.db" if File.exists? "./table.db"
    end    

    def test_insert
      Table.new ["id", "v1", "v2"], [[1, 23, "a"], [2, 34, "b"]], "test"
      assert_equal [[23.0, "a"], [34.0, "b"]], Table.query("select v1,v2 from test")
    end

    def test_method_missing
      tbl = Table.new ["id", "col1", "col2"], [[1, 23, "a"], [2, 34, "b"]], "test"
      assert_equal [23.0, 34.0], tbl.col1
    end

    def test_add
      tbl = Table.new ["id", "col1"], [[1, 23], [2, 34]], "test"
      tbl.add("col2", tbl.col1.map{|v|v+1} ) # v1+=1 as v2
      assert_equal [24.0, 35.0], tbl.col2
      tbl.add("col3", ["random", "stuff"])
      assert_equal ["random", "stuff"], tbl.col3
    end

    def test_join
      Table.new ["id", "v1", "v2"], [[1, 23, "a"], [2, 34, "b"]], "tbl1"
      Table.new ["id", "v3", "v4"], [[1, 24, "c"], [2, 36, "d"]], "tbl2"
      sql = "select tbl1.id,v1,v4 from tbl1 join tbl2 on tbl1.id = tbl2.id"
      assert_equal [[1, 23, "c"], [2, 34, "d"]], Table.query(sql)
    end

    def test_alias
      Table.new ["g", "v"], [["a",11], ["a",9], ["b",2], ["b",2]], "tbl"
      tbl2 = Table.query("select g, sum(v) as value from tbl group by g", "tbl2")
      assert_equal ["a", "b"], tbl2.g
      assert_equal [20.0, 4.0], tbl2.value
    end

    def test_join_with_new_table
      Table.new ["id", "v1", "v2"], [[1, 11, "a"], [2, 12, "b"]], "tbl1"
      Table.new ["id", "v3", "v4"], [[1, 21, "c"], [2, 22, "d"]], "tbl2"
      sql = "select tbl1.id,v1,v4 from tbl1 join tbl2 on tbl1.id = tbl2.id"
      Table.query(sql, "tbl3")
      assert_equal [[1, 11, "c"], [2, 12, "d"]], Table.query("select * from tbl3")
    end

    def test_ad_hoc
      Table.new ["id", "v1", "v2"], [[1, 23, "a"], [2, 34, "b"]], "test"
      val = nil
      Table.with_db {|db| val = db.execute("select v1 from test limit 1") }
      assert_equal [23.0], val.flatten
    end

    def test_date
      Table.new ["ts", "v"], [[Date.today, 23], [Date.today+10, 34]], "test"
      sql = "select v from test where ts < '#{Date.today+2}'"
      assert_equal [23.0], Table.query(sql).flatten
    end

    def test_time
      Table.new ["ts", "v"], [[Time.now, 10], [Time.now+10, 20]], "test"
      sql = "select v from test where ts < '#{Time.now+2}'"
      assert_equal [10.0], Table.query(sql).flatten
    end

  end
end

Monitoring Productivity II - the Others

2011-09-30T00:00:00+00:00

In previous Monitoring Productivity Experiment post I looked into the hours I spent in computer, now I look into the hours Others spend in computer, which is far more interesting :) To find things like what day people spend more time on computer, how many hours they work, and general activity patterns.

Collecting data

In osx, is possible to use growl to display a message when a skype user signs in. So I configured growl to log the sign-in’s and sign-out’s of my skype contacts.

Like so:

> touch ~/Desktop/growl.log
> defaults write com.Growl.GrowlHelperApp GrowlLoggingEnabled -bool YES
> defaults write com.Growl.GrowlHelperApp GrowlLogType 1
> defaults write com.Growl.GrowlHelperApp "Custom log 1" ~/Desktop/growl.log

Instructions here.

And then I left my skype signed in for a few weeks, while I was on vacations.

Parsing data

Read the log file and create a semicolon separated file:

#!/usr/bin/env ruby
puts "timestamp;user;status"
File.open(ARGV[0]).each_line do |l|
  if l.include? "online" or l.include? "offline"
    date  = l.split('Skype')[0].strip
    user  = l.scan(/Skype:([^\(]*)/)[0][0].strip
    status = l.include?("online") ? "online" : "offline"
    puts "#{date};#{user};#{status}"
  end
end

Load it into R

data = read.csv("/my/proj/skype-growl/log.csv", sep=";", header=TRUE)

# parse dates "Aug 24, 2011 3:58:01 PM"
data$date = as.POSIXct(strptime(data$timestamp,"%b %d, %Y %I:%M:%S %p")) # DateTime
data$hour = format(data$date, format="%H:%M:%S")       # string
data$time = as.POSIXct(data$hour, format = "%H:%M:%S") # DateTime
data$day  = format(data$date, format="%m/%d/%y")       # string
data$weekday = format(data$date, format="%A")          # string

# filter for complete days of data
data = sqldf("select * from data where day >= '08/25/2011' and day <= '09/21/2011'")
sqldf("select count(distinct(day)) from data")

27 days of data.

The sign-in’s and sign-out’s of a random person

randomperson = sqldf("select user from data group by random() limit 1")

d = sqldf(sprintf("select * 
                   from data 
                   where user = '%s' and day  >= '09/04/2011' and 
                   day  <= '09/12/2011'", randomperson[1,1]))

ggplot(data=d, aes(y=time, x=date)) + geom_point(aes(color=status), alpha=0.6) +  scale_x_datetime(major = "1 days") + scale_y_datetime(major = "1 hours")

10-Sep is Saturday and 11-Sep is Sunday, means skype was off on the weekend
start of workday between 9h-11h
end of workday between 18h-19h
skype is offline after working hours, except on night of Monday 05-Sep

Online Activity Patterns

Plotting all sign-in’s and sign-out’s over each weekday we can get a feeling for overall online activity:

ggplot(data, aes(x=time,..density..)) + geom_histogram() + facet_grid(weekday ~ .)

Night time has less activity, and gets progressively smaller as night goes by
Around 9am activity spikes (people start work?)
Around 17h/18h activity spikes (people ending work?)
The 9h & 17h spikes are not so well formed in the weekend, thus very likelly connected to work
Sundays after dinner time(or so) people seems to start get online again before going to sleep
On weekends computer gets more use later in the day

How many hours people work?

More tricky to accurately measure but we can have a guess:

assuming that people are working during working hours of workdays
assuming that nobody start works before 6am, and nobody ends work after 21pm

Then, the first activity after 6am is start of work, and the last activity change before 21pm is the end of work.

d = sqldf("select user, 
                  day, 
                  weekday,
                  min(hour) as start, 
                  max(hour) as end
           from data
           where hour >= '06:00:00' and hour <= '21:00:00' and
                 weekday <> 'Saturday' and weekday <> 'Sunday'
           group by user, day")
d$totalhours = difftime(as.POSIXct(d$end, format = "%H:%M:%S"), as.POSIXct(d$start, format = "%H:%M:%S"))
d$totalhours = as.numeric(d$totalhours, units="hours")

# excude less than 2 hours/day, means bots, vacations, etc...
dt = sqldf("select * from d where totalhours > 2")

al3x.load() # my own collection of R functions
al3x.hist(dt, "totalhours")

Workday total hours are mostly between 6 and 12 hours, most common being the 8.5 hours/day.

Which day people spend more time in computer?

We can try counting the amount of sign-in’s/sign-out’s changes per day, means people are more likely to be in computer.

d = sqldf("select weekday, count(status) as amount
           from data
           group by weekday
           order by sum(time) DESC")
ggplot(d, aes(x=weekday,y=amount)) + geom_bar(stat="identity")

As the above could be biased in a number of ways lets use another way to measure it and if the results match then original estimate should be ok.

For example, way to go about it is to sum up the total working hours for each day:

d = sqldf("select user, 
                  day, 
                  weekday,
                  min(hour) as start, 
                  max(hour) as end
           from data
           where hour >= '06:00:00' and hour <= '21:00:00'
           group by user, day")
d$totalhours = difftime(as.POSIXct(d$end, format = "%H:%M:%S"), as.POSIXct(d$start, format = "%H:%M:%S"))
d$totalhours = as.numeric(d$totalhours, units="hours")

sqldf("select weekday, sum(totalhours) as amount
           from d
           group by weekday
           order by sum(totalhours) DESC")

Getting:

    weekday    amount
1   Tuesday 15404.471
2 Wednesday 15191.946
3    Monday 14298.472
4    Friday 12426.091
5  Thursday 11638.443
6  Saturday  5222.874
7    Sunday  5198.367

Almost same results, great.

Thus Tuesday is the day people spend more time in computer, and in decreasing order:

Tuesday > Wednesday > Monday > Friday > Thursday > (Saturday or Sunday)

On productivity this means that Tuesday is the most productive day, while Thursday is the least.

Dashboarding

2011-05-24T00:00:00+00:00

An important part of being data driven is to have a daily feedback on data, here’s a couple of automated dashboards i’ve built recently:

In these most of the data is displayed as is, next iteration could enrich the data further with:

Adding the data of a year/6months ago for direct comparison could be interesting.
Fits to data, like a regression line that shows the overall tendency, plus allows to make predictions on next day/week/month values.
More of relative change plots, like the protovis index-chart are very useful.
Confidence intervals pointing out that the changes are unlikely to be by chance.
etc…

Tools & Code

Coded in ruby, it aggregates data from different sources, and based on an html template it generates html with the full dashboard. Its fully automated, and i make it run on daily basis using a cron job.

Uses highcharts as the javascript charting engine, which i can only say good things about, very nice looking and allows user interaction.

I placed on github the code i use as the base to build the dashboards, find it here: https://github.com/al3xandr3/Dashboard

A few bits:

Getting RescueTime Data

require 'open-uri'
require 'date'

key = "yourownkey"

res = {} 
open("https://www.rescuetime.com/anapi/data?key=#{key}&perspective=interval&format=csv&resolution_time=day&restrict_kind=activity") do |f|
  i=0    
  f.each do |l|    
    unless i==0
      t, sec, some, app, cat, prod = l.split(",")
      res[:week] += sec.to_i
      res[:day] += sec.to_i if Date.parse(t).day == Date.today.day
    end
    i+= 1
  end
end

print res

Getting Google Spreadsheets Data

require 'gdata/client'  
require 'gdata/http'  
require 'gdata/auth'
require 'open-uri'
require 'date'

client = GData::Client::Spreadsheets.new
client.clientlogin('yourmail@gmail.com', "yourpass")
key = "yourspreadsheetkey"
test = client.get("http://spreadsheets.google.com/feeds/download/spreadsheets/Export?key=#{key}&fmcmd&exportFormat=csv")
values = []
i=0
test.body.each_line do |l|
  t,w,co,wa,h = l.gsub("\n","").split(',')
  unless i==0
    values << [Date.parse(t), w.to_f, wa.to_f, h.to_f] 
  end
  i+=1
end

print values

Getting imap mail attachments

require 'net/imap'
require 'date'

opts[:inbox]   ||= "Inbox"
opts[:search]  ||= ["SINCE", "8-Aug-2007"]
opts[:attach]  ||= ["CSV"]
opts[:savedir] ||= "."

imap = Net::IMAP.new('mail.server.com', :port => 993, :ssl => true)
imap.login('yourmail@server.com', 'yourpassw')    
imap.select(opts[:inbox])
imap.search(opts[:search]).each do |uid|
  msg = imap.fetch(uid, ["ENVELOPE","UID","BODY"])[0]
  body = msg.attr["BODY"]
  date = Date.parse(msg.attr["ENVELOPE"].date)
  i = 1
  while body.parts[i] != nil
    type = body.parts[i].subtype
    encoding = body.parts[i].encoding
    name = body.parts[i].param["NAME"] || date.to_s
    i+=1
    attachment = imap.fetch(uid, "BODY[#{i}]")[0].attr["BODY[#{i}]"]
    p "#{name}, #{type}, #{encoding}"
    if opts[:attach].include? type and not attachment.nil?
      File.open(opts[:savedir] + name,'wb+') do |f|
        if encoding == "BASE64"
          f.write(attachment.unpack('m')[0])
        else
          f.write(attachment)
        end          
      end
    end
  end  
end

Posting html to a confluence wiki

require 'xmlrpc/client'

user = "username"
pass = "password"
area = "area"
page_name="page"
content = "<h1>Big Header</h1>"
confluence = XMLRPC::Client
      .new2("https://#{user}:#{pass}@confluence.server.com/rpc/xmlrpc")
      .proxy("confluence1")
page = confluence.getPage("", area, page_name)
page["content"] = "{html}#{content}{html}"
confluence.storePage("", page)

Creating a highcharts JS chart

require 'erb'
require 'date'

def line arg={}

  arg[:height] = arg[:height] || ""
  arg[:width] = arg[:width] || ""

  line_chart = %{
    <div id="<%= arg[:name] %>" style="height:<%= arg[:height] %>px;width:<%= arg[:width] %>px;"></div>
    <script type="text/javascript">
     var month = new Array("Jan","Feb","Mar","Apr","May","Jun",
                           "Jul","Aug","Sept","Oct","Nov","Dec");
     var chart;
     $(document).ready(function() {
     chart = new Highcharts.Chart({
        chart: {
           renderTo: '<%= arg[:name] %>',
           defaultSeriesType: 'line',
           marginRight: 40,
           marginBottom: 40
        },
        credits:{
          enabled:false
        },
        plotOptions: {
           line: {
              dataLabels: {
                 enabled: <%= arg[:data_labels] || false %>
              }
           }
        },
        title: {
           text: '<%= arg[:name] %>',
           x: -20 //center
        },
        subtitle: {
           text: '<%= arg[:subtitle] %>',
           x: -20
        },
        xAxis: {
           type: "datetime",
           title: {
              text: '<%= arg[:xlabel] %>'
           },
        },
        yAxis: {
           min: <%= arg[:ymin] || 0 %>,
           title: {
              text: '<%= arg[:ylabel] %>'
           },
        },
        tooltip: {
           formatter: function() {
             return (new Date(this.x)).getDate() + ' ' +   
                    month[(new Date(this.x)).getMonth()] + 
                     ': '+ this.y;
           }
        },
        legend: {
           layout: 'vertical',
           align: 'right',
           verticalAlign: 'top',
           x: 0,
           y: 0,
           borderWidth: 2
        },
        series: [{
           pointInterval: 24 * 3600 * 1000,
           pointStart: <%= arg[:start_time] %>,
           name: '<%= arg[:name] %>',
           data: <%= arg[:values] %>
        }]
       });
      });
    </script>
    }
  ERB.new(line_chart).result(binding)
end

c = line(:name => "My Fancy Chart",
     :subtitle => "subtitle",
     :xlabel => "y label",
     :ylabel => "y label",
     :start_time => (Date.today-7).to_time.to_i * 1000, 
     :values => [12.2, 13.3, 11.1, 15.5])
print c

Who Chats the Most?

2011-04-28T00:00:00+00:00

From my Skype chat history, a visualization of the counts of chats by (anonymised) user.

Code

require 'sqlite3'
require 'rubyvis'

contacts={}
  
# count
db = SQLite3::Database.new("[skype-folder]/main.db")
db.execute("SELECT author, count(author) FROM Messages GROUP BY author ORDER BY count(author) DESC" ) do |author, count|
  #contacts[author]=count # real ones
  contacts[author.split('').sample(3).join]=count if count>60 # Anonymized
end

cs=pv.Colors.category20()
format=Rubyvis::Format.number
color = pv.Colors.category20
nodes = pv.dom(contacts).root("rubyvis").nodes

vis = pv.Panel.new()
    .width(600)
    .height(1000)

treemap = vis.add(Rubyvis::Layout::Treemap).
  nodes(nodes).
  mode("squarify").
  round(true)

treemap.leaf.add(Rubyvis::Panel).
  fill_style(lambda{ |d| cs.scale(d) }).
  stroke_style("#fff").
  line_width(1).
  antialias(true).
  title(lambda {|d| d.node_name + " " + format.format(d.node_value)})

treemap.node_label.add(Rubyvis::Label).
  text_style(lambda {|d| pv.rgb(0, 0, 0, 1)}).
  font(lambda{|d| v=d.node_value/90; (v<=8)? "#8px sans-serif" : "#{v}px sans-serif"})
vis.render

# saves an svg
File.open("contacts.svg", "w+").write vis.to_svg

Machine Learning Ex5.2 - Regularized Logistic Regression

2011-03-20T00:00:00+00:00

Exercise 5.2 Improves the Logistic Regression implementation done in Exercise 4 by adding a regularization parameter that reduces the problem of over-fitting. We will be using Newton’s Method.

With implementation in R.

Data

Here’s how the data we want to fit, looks like:

google.spreadsheet <- function (key) {
  library(RCurl)
  # ssl validation off
  ssl.verifypeer <- FALSE

  tt <- getForm("https://spreadsheets.google.com/spreadsheet/pub", 
                hl ="en_GB",
                key = key, 
                single = "true", gid ="0", 
                output = "csv", 
                .opts = list(followlocation = TRUE, verbose = TRUE)) 

  read.csv(textConnection(tt), header = TRUE)
}

# load the data
mydata = google.spreadsheet("0AnypY27pPCJydHZPN2pFbkZGd1RKeU81OFY3ZHJldWc")

# plot the data
plot(mydata$u[mydata$y == 0], mydata$v[mydata$y == 0],, xlab="u", ylab="v")
points(mydata$u[mydata$y == 1], mydata$v[mydata$y == 1], col="blue", pch=3)
legend("topright", c("y=0","y=1"), pch=c(1, 3), col=c("black", "blue"), bty="n")

The idea of “fitting” is to create a mathematical model, that will separate the circles from the crosses in the plot above by learning from the existing data. That will then allow to make predictions for a new u and v value, the probability of being a cross.

Theory

Hypothesis is:

$h_\theta(x) = g(\theta^T x) = \frac{1}{ 1 + e ^{- \theta^T x} }$

Regularization is all about loosen up the tight fit, avoiding over-fitting and thus obtain a more generalized fit, that more likely will work better on new data(for doing predictions).

For that we define the cost function, with an added generalization parameter that blunts the fit, like so:

$J(\theta) = \frac{1}{m} \sum_{i=1}^m [(-y)log(h_\theta(x)) - (1 - y) log(1- h_\theta(x))] + \frac{\lambda}{2m} \sum_{i=1}^n \theta^2$

lambda is called the regularization parameter.

The iterative theta updates using Newton’s method is defined as:

$\theta^{(t+1)} = \theta^{(t)} - H^{-1} \nabla_{\theta}J$

And the gradient and Hessian are defined like so(in vectorized versions):

$\nabla_{\theta}J = \frac{1}{m} \sum_{i=1}^m (h_\theta(x) - y) x + \frac{\lambda}{m} \theta$ $H = \frac{1}{m} \sum_{i=1}^m [h_\theta(x) (1 - h_\theta(x)) x^T x] + \frac{\lambda}{m} \begin{bmatrix} 0 & & & \\ & 1 & & \\ & & ... & \\ & & & 1 \end{bmatrix}$

Implementation

Lets first define the functions above, with the added generalization parameter:

# sigmoid function
g = function (z) {
  return (1 / (1 + exp(-z)))
} # plot(g(c(1,2,3,4,5,6)), type="l")

# build hight order feature vector
# for 2 features, for a given degree
hi.features = function (f1,f2,deg) {
  n = ncol(f1)
  ma = matrix(rep(1,length(f1)))
  for (i in 1:deg) {
    for (j in 0:i)    
      ma = cbind(ma, f1^(i-j) * f2^j)
  }
  return(ma)
} # hi.features(c(1,2), c(3,4),2)
# creates: 1 u v u^2 uv v^2 ...

# hypothesis
h = function (x,th) {
  return(g(x %*% th))
} # h(x,th)

# derivative of J 
grad = function (x,y,th,m,la) {
  G = (la/m * th)
  G[1,] = 0
  return((1/m * t(x) %*% (h(x,th) - y)) +  G)
} # grad(x,y,th,m,la)

# hessian
H = function (x,y,th,m,la) {
  n = length(th)
  L = la/m * diag(n)
  L[1,] = 0
  return((1/m * t(x) %*% x * diag(h(x,th)) * diag(1 - h(x,th))) + L)
} # H(x,y,th,m,la)

# cost function
J = function (x,y,th,m,la) {
  pt = th
  pt[1] = 0
  A = (la/(2*m))* t(pt) %*% pt
  return((1/m * sum(-y * log(h(x,th)) - (1 - y) * log(1 - h(x,th)))) + A)
} # J(x,y,th,m,la)

Now we can make it iterate until convergence, first for lambda=1

# setup variables
m = length(mydata$u) # samples
x = hi.features(mydata$u, mydata$v,6)
n = ncol(x) # features
y = matrix(mydata$y, ncol=1)

# lambda = 1
# use the cost function to check is works
th1 = matrix(0,n)
la = 1
jiter = array(0,c(15,1))
for (i in 1:15) {
  jiter[i] = J(x,y,th1,m,la)
  th1 = th1 - solve(H(x,y,th1,m,la)) %*% grad(x,y,th1,m,la) 
}

Validate that is converging properly, by plotting the Cost(J) function against the number of iterations.

# check that is converging correctly
plot(jiter, xlab="iterations", ylab="cost J")

Converging well and fast, as is typical from Newton’s method.

And now we make it iterate for lambda=0 and lambda=10 for comparing fits later:

# lambda = 0
th0 = matrix(0,n)
la = 0
for (i in 1:15) {
  th0 = th0 - solve(H(x,y,th0,m,la)) %*% grad(x,y,th0,m,la) 
}

# lambda = 10
th10 = matrix(0,n)
la = 10
for (i in 1:15) {
  th10 = th10 - solve(H(x,y,th10,m,la)) %*% grad(x,y,th10,m,la) 
}

Finally calculate the decision boundary line and visualize it:

# calculate the decision boundary line
# by creating many points
u = seq(-1, 1.2, len=200);
v = seq(-1, 1.2, len=200);
z0 = matrix(0, length(u), length(v))
z1 = matrix(0, length(u), length(v))
z10 = matrix(0, length(u), length(v))
for (i in 1:length(u)) {
  for (j in 1:length(v)) {
    z0[j,i] =  hi.features(u[i],v[j],6) %*% th0
    z1[j,i] =  hi.features(u[i],v[j],6) %*% th1
    z10[j,i] =  hi.features(u[i],v[j],6) %*% th10
  }
}

# plots
contour(u,v,z0,nlev = 0, xlab="u", ylab="v", nlevels=0, col="black",lty=2)
contour(u,v,z1,nlev = 0, xlab="u", ylab="v", nlevels=0, col="red",lty=2, add=TRUE)
contour(u,v,z10,nlev = 0, xlab="u", ylab="v", nlevels=0, col="green3",lty=2, add=TRUE)
points(mydata$u[mydata$y == 0], mydata$v[mydata$y == 0])
points(mydata$u[mydata$y == 1], mydata$v[mydata$y == 1], col="blue", pch=3)
legend("topright",  c(expression(lambda==0), expression(lambda==1),expression(lambda==10)), lty=1, col=c("black", "red","green3"),bty="n" )

See that the black line (lambda=0) is the more tightly fit to the crosses, and as we increase the lambda values it becomes more loose(and more generalized) and consequently a better predictor for new unseen data.

References

Thanks to Andrew Ng and OpenClassRoom for the great lessons.

Machine Learning Ex5.1 - Regularized Linear Regression

2011-03-18T00:00:00+00:00

Exercise 5.1 Improves the Linear Regression implementation done in Exercise 3 by adding a regularization parameter that reduces the problem of over-fitting.

Over-fitting occurs especially when fitting a high-order polynomial, that we will try to do here.

With implementation in R.

Data

Here’s the points we will make a model from:

google.spreadsheet <- function (key) {
  library(RCurl)
  # ssl validation off
  ssl.verifypeer <- FALSE

  tt <- getForm("https://spreadsheets.google.com/spreadsheet/pub", 
                hl ="en_GB",
                key = key, 
                single = "true", gid ="0", 
                output = "csv", 
                .opts = list(followlocation = TRUE, verbose = TRUE)) 

  read.csv(textConnection(tt), header = TRUE)
}

# load the data
mydata = google.spreadsheet("0AnypY27pPCJydGhtbUlZekVUQTc0dm5QaXp1YWpSY3c")

# view data
plot(mydata)

Theory

We will fit a 5th order polynomial, so the hypothesis is:

$h_\theta(x) = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2^2 + \theta_3 x_3^3 + \theta_4 x_4^4 + \theta_5 x_5^5$

With x_0 = 1

The idea of the regularization is to blunt the fit a bit, i.e. loosen up the tight fit.

For that we define the cost function like so:

$J(\theta) = \frac{1}{2m} [\sum_{i=1}^m ((h_\theta(x^{(i)}) - y^{(i)})^2) + \lambda \sum_{i=1}^n \theta^2]$

The Lambda is called the regularization parameter.

The regularization parameter added at the end will influence the exact cost values on all parameters. This will reflect in the search for the (\theta) parameters and consequently loosen up the tight fit.

After some math that is not shown here, the normal equations with the regularization parameter added, become:

$\theta = (X^T X + \lambda \begin{bmatrix} 0 & & & \\ & 1 & & \\ & & ... & \\ & & & 1 \end{bmatrix} )^{-1} (X^T y)$

Implementation

We will try 3 different lambda values to see how it influences the fit. Starting with lambda=0 where we can see the fit without the regularization parameter.

# setup variables
m = length(mydata$x) # samples
x = matrix(c(rep(1,m), mydata$x, mydata$x^2, mydata$x^3, mydata$x^4, mydata$x^5), ncol=6)
n = ncol(x) # features
y = matrix(mydata$y, ncol=1)
lambda = c(0,1,10)
d = diag(1,n,n)
d[1,1] = 0
th = array(0,c(n,length(lambda)))

# apply normal equations for each of the lambda's
for (i in 1:length(lambda)) {
  th[,i] = solve(t(x) %*% x + (lambda[i] * d)) %*% (t(x) %*% y)
}

# plot
plot(mydata)

# lets create many points
nwx = seq(-1, 1, len=50);
x = matrix(c(rep(1,length(nwx)), nwx, nwx^2, nwx^3, nwx^4, nwx^5), ncol=6)
lines(nwx, x %*% th[,1], col="blue", lty=2)
lines(nwx, x %*% th[,2], col="red", lty=2)
lines(nwx, x %*% th[,3], col="green3", lty=2)
legend("topright", c(expression(lambda==0), expression(lambda==1),expression(lambda==10)), lty=2,col=c("blue", "red", "green3"), bty="n")

With the lambda=0 the fit is very tight to the original points (the blue line) but as we increase lambda, the model gets less tight(more generalized) and thus avoiding over-fitting.

References:

Exercise 3, original Linear Regression implementation
Thanks to Andrew Ng and OpenClassRoom for the great lessons.

Machine Learning Ex4 - Logistic Regression

2011-03-16T00:00:00+00:00

Exercise 4 is all about implementing Logistic Regression using Newton’s Method, on a classification problem.

For all this to make sense i suggest having a look at Andrew Ng machine learning lectures on openclassroom.

We start with a dataset representing 40 students who were admitted to college and 40 students who were not admitted, and their corresponding grades for 2 exams. Your mission, should you decide to accept it is to build a binary classification model that estimates college admission chances based on a student’s scores on two exams(test1 and test2).

With implementation in R.

Plot the Data

We start by looking at the data.

google.spreadsheet <- function (key) {
  library(RCurl)
  # ssl validation off
  ssl.verifypeer <- FALSE

  tt <- getForm("https://spreadsheets.google.com/spreadsheet/pub", 
                hl ="en_GB",
                key = key, 
                single = "true", gid ="0", 
                output = "csv", 
                .opts = list(followlocation = TRUE, verbose = TRUE)) 

  read.csv(textConnection(tt), header = TRUE)
}

# load the data
mydata = google.spreadsheet("0AnypY27pPCJydC1vRVEzM1VJQnNneFo5dWNzR1F5Umc")

# plots
plot(mydata$test1[mydata$admitted == 0], mydata$test2[mydata$admitted == 0], xlab="test1", ylab="test2", , col="red")
points(mydata$test1[mydata$admitted == 1], mydata$test2[mydata$admitted == 1], col="blue", pch=2)
legend("bottomright", c("not admitted", "admitted"), pch=c(1, 2), col=c("red", "blue") )

A Bit of Theory

Most of the ideas explored in linear regression apply in same way, first we define what the hypothesis equation looks like(the mathematical representation of this knowledge). It originates from the line equation, but has now evolved into a new equation that returns values between [0,1] suited for binary classification. That is, we made up an equation that given test1 value and test2 value, will return the probability that the student will be admitted(y=1) into college:

$h_\theta(x) = g(\theta^T x) = \frac{1}{ 1 + e ^{- \theta^T x} }$

g is the sigmoid function. And this returns:

$h_\theta(x) = P (y=1 | x; \theta)$

Now we need to find the (\theta) parameters for getting a working hypothesis equation. To help with that search we define a cost equation, that for a given (\theta) returns how far off we are compared to the sample data.

$J(\theta) = \frac{1}{m} \sum_{i=1}^m ((-y)log(h_\theta(x)) - (1 - y) log(1- h_\theta(x)) )$

The lower the cost the better(closer to real data we get). Thus, the goal becomes to minimize the cost.

We can use Newton’s method for that. Newton’s method, similarly to gradient descent, is a way to search for the 0(minimum) of the derivative of the cost function. And after doing some math, the iterative (\theta) updates using Newton’s method is defined as:

$\theta^{(t+1)} = \theta^{(t)} - H^{-1} \nabla_{\theta}J$

And the gradient and Hessian are defined like so(in vectorized versions):

$\nabla_{\theta}J = \frac{1}{m} \sum_{i=1}^m (h_\theta(x) - y) x$ $H = \frac{1}{m} \sum_{i=1}^m [h_\theta(x) (1 - h_\theta(x)) x^T x]$

Implementation

First we implement the above equations:

# sigmoid
g = function (z) {
  return (1 / (1 + exp(-z) ))
} # plot(g(c(1,2,3,4,5,6)))

# hypothesis 
h = function (x,th) {
  return( g(x %*% th) )
} # h(x,th)

# cost
J = function (x,y,th,m) {
  return( 1/m * sum(-y * log(h(x,th)) - (1 - y) * log(1 - h(x,th))) )
} # J(x,y,th,m)

# derivative of J (gradient)
grad = function (x,y,th,m) {
  return( 1/m * t(x) %*% (h(x,th) - y))
} # grad(x,y,th,m)

# Hessian
H = function (x,y,th,m) {
  return (1/m * t(x) %*% x * diag(h(x,th)) * diag(1 - h(x,th)))
} # H(x,y,th,m)

Make it go (iterate until convergence):

# setup variables
j = array(0,c(10,1))
m = length(mydata$test1)
x = matrix(c(rep(1,m), mydata$test1, mydata$test2), ncol=3)
y = matrix(mydata$admitted, ncol=1)
th = matrix(0,3)

# iterate 
# note that the newton's method converges fast, 10x is enough
for (i in 1:10) {
  j[i] = J(x,y,th,m) # stores each iteration Cost
  th = th - solve(H(x,y,th,m)) %*% grad(x,y,th,m) 
}

Have a look at the cost function by iteration:

plot(j, xlab="iterations", ylab="cost J")

See that the number of iterations needed is only 4-5, converges much faster than gradient descent.

Exercise questions:

# 1. What values of  did you get? How many iterations were required for convergence?
print("1.")
print(th)

# 2. What is the probability that a student with a score of 20 on Exam 1
# and a score of 80 on Exam 2 will not be admitted?
print("2.")
print((1 - g(c(1, 20, 80) %*% th))* 100)

         
          1
[1] "1."
            [,1]
[1,] -16.4469479
[2,]   0.1457278
[3,]   0.1618285
[1] "2."
         [,1]
[1,] 64.24722

To visualize the fit, an important remark is that: $P(y=1 | x ;\theta) = 0.5$ that happens when: $\theta^T x=0$ Thus

# when ax0 + bx2 + cx3 = 0 is the middle(decision boundary line),
# so given x1 from sample data, solving to x2, we get:
x2 = (-1/th[3,]) * ((th[2,] * x1) + th[1,])

# get 2 points (that will define a line)
x1 = c(min(x[,2]), max(x[,2]))

# plot
plot(x1,x2, type='l',  xlab="test1", ylab="test2")
points(mydata$test1[mydata$admitted == 0], mydata$test2[mydata$admitted == 0], col="red")
points(mydata$test1[mydata$admitted == 1], mydata$test2[mydata$admitted == 1], col="blue", pch=2)
legend("bottomright", c("not admitted", "admitted"), pch=c(1, 2), col=c("red", "blue") )

Beautiful

Notes & References:

Thanks Tal Galili for adding my blog into the R-bloggers.com list. Go have a peek, R-bloggers is a great source of R information.
Exercise 4 here
Lectures here
Thanks to Andrew Ng and OpenClassRoom for the great lessons.

Machine Learning Ex3 - Multivariate Linear Regression

2011-03-08T00:00:00+00:00

Exercise 3 is about multivariate linear regression. Start by finding a good learning rate (alpha) and then implement linear regression using the normal equations instead of the gradient descent algorithm.

With implementation in R.

Data

As usual hosted in google docs:

google.spreadsheet <- function (key) {
  library(RCurl)
  # ssl validation off
  ssl.verifypeer <- FALSE

  tt <- getForm("https://spreadsheets.google.com/spreadsheet/pub", 
                hl ="en_GB",
                key = key, 
                single = "true", gid ="0", 
                output = "csv", 
                .opts = list(followlocation = TRUE, verbose = TRUE)) 

  read.csv(textConnection(tt), header = TRUE)
}

# load the data
mydata = google.spreadsheet("0AnypY27pPCJydExfUzdtVXZuUWphM19vdVBidnFFSWc")

Feature Scaling

When applying the gradient descent, we need to make sure that features values are in the same order of magnitudes, otherwise it will not converge well, so here’s a helper function to scale features:

# given a data frame and the column names i want to scale
# creates new columns: feature.scale = (feature - mean)/std
feature.scale = function (dta, cols) {
  for (col in cols) {
    sigma = sd(dta[col])
    mu = mean(dta[col])
    dta[paste(names(dta[col]), ".scale", sep = "")] = (dta[col] - mu)/sigma
  }
  return(dta)
}

dta = feature.scale(mydata, c("area", "bedrooms"))
tail(dta, 5)



   area bedrooms  price area.scale bedrooms.scale
43 2567        4 314000  0.7126179      1.0904165
44 1200        3 299000 -1.0075229     -0.2236752
45  852        2 179900 -1.4454227     -1.5377669
46 1852        4 299900 -0.1870900      1.0904165
47 1203        3 239500 -1.0037479     -0.2236752

Finding Alpha

Recall from ex2 that the gradient descent equation for the updates of theta is:

$\theta := \theta - \alpha \frac{1}{m} x^T (x\theta^T - y)$

For finding a good alpha($\alpha$) we will use a trial and error approach. The idea is look at how the Cost value $J(\alpha)$ drops with the number of iterations, the fastest the drop the better, but if goes up then the alpha value is already too large.

The Cost is given by(in vectorized form):

$J(\theta) = \frac{1}{2m} (X\theta - y)^T (X\theta - y)$

See the lessons on details how to reach that equation.

Implementing:

# lets try out a few alpha's
alpha = c(0.03, 0.1, 0.3, 1, 1.3, 2)

# store the J values over the iterations
J = array(0,c(50,length(alpha)))
m = length(dta$price)
theta = matrix(c(0,0,0), nrow=1)
x = matrix(c(rep(1,m), dta$area.scale, dta$bedrooms.scale), ncol=3)
y = matrix(dta$price, ncol=1)

# the delta updates
delta = function(x,y,th) {
  delta = (t(x) %*% ((x %*% t(th)) - y))
  return(t(delta))
}

# the cost for a given theta
cost = function(x,y,th,m) {
  prt = ((x %*% t(th)) - y)
  return(1/m * (t(prt) %*% prt))
}

# run J for 50x, on each alpha
for (j in 1:length(alpha)) {
  for (i in 1:50) {
    J[i,j] = cost(x,y,theta,m) # capture the Cost
    theta = theta - alpha[j] * 1/m * delta(x,y,theta)
  }
}

# lets have a look
par(mfrow=c(3,2))
for (j in 1:length(alpha)) {
  plot(J[,j], type="l", xlab=paste("alpha", alpha[j]), ylab=expression(J(theta)))
}

alpha 1 seems to be the best.

Setting $\alpha=1$ and running until convergence:

# running until convergence
for (i in 1:50000) {
  theta = theta - 1 * 1/m * delta(x,y,theta)
  if (abs(delta(x,y,theta)[2]) < 0.0000001) {  
    break # to interrupt updates
  }
}

# 1. The final values of theta
print("Theta:")
print(theta)

# 2. The predicted price of a house with 1650 square feet and 3 bedrooms.
# Don't forget to scale your features when you make this prediction!
print("Prediction for a house with 1650 square feet and 3 bedrooms:")
print(theta %*% c(1, (1650 - mean(dta["area"]))/sd(dta["area"]), (3 - mean(dta["bedrooms"]))/sd(dta["bedrooms"])))



 Warning message:
closing unused connection 3 (tt)
[1] "Theta:"
         [,1]     [,2]      [,3]
[1,] 340412.7 110631.1 -6649.474
[1] "Prediction for a house with 1650 square feet and 3 bedrooms:"
         [,1]
[1,] 293081.5

Normal Equations

Given the cost function:

$J(\theta) = \frac{1}{2m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2$

Recall that this function returns how big is the error of our model vs the data. Thus our goal is to minimize it. And in order to find its minimum there is also a more direct approach (instead of using gradient descent) we can just calculate its derivative set it to 0 and find the value of theta:

$\frac{\delta}{\delta \theta_j} J(\theta_j) = 0$

thats for $\theta_j$. We need of course to account for every j.

If we write it down into matrix notation, calculate its derivatives and set it to 0, then the value of theta will be obtained with:

$\theta = (X^T X)^{-1} (X^T y)$

That can be easily implemented like so:

x = matrix(c(rep(1,m), mydata$area, mydata$bedrooms), ncol=3)
y = matrix(mydata$price, ncol=1)
theta.normal = solve(t(x) %*% x) %*% (t(x) %*% y)

# 1. In your program, use the formula above to calculate. Remember that
# while you don't need to scale your features, you still need to add 
# an intercept term.
print("Theta:")
print(theta.normal)

# 2. Once you have found  from this method, use it to make a price prediction 
# for a 1650-square-foot house with 3 bedrooms. Did you get the same price 
# that you found through gradient descent?
print("Price prediction for a 1650-square-foot house with 3 bedrooms")
t(theta.normal) %*%  c(1, 1650, 3)



[1] "Theta:"
           [,1]
[1,] 89597.9095
[2,]   139.2107
[3,] -8738.0191
[1] "Price prediction for a 1650-square-foot house with 3 bedrooms"
         [,1]
[1,] 293081.5

Normal equations are more direct but also more costly than gradient descent to run, so depending on situation you might need to choose one or the other.

References

OpenClassroom Machine Learning
Exercise 3: Multivariate Linear Regression
Thanks to Andrew Ng and OpenClassRoom for the great lessons.

Machine Learning Ex2 - Linear Regression

2011-02-24T00:00:00+00:00

Andrew Ng has posted introductory machine learning lessons on the OpenClassRoom site. I’ve watched the first set and will here solve Exercise 2.

The exercise is to build a linear regression implementation, I’ll use R.

The point of linear regression is to come up with a mathematical function(model) that represents the data as best as possible, that is done by fitting a straight line to the observed data. This model will then allow us to make predictions on new data.

For example, the data we use here are boys ages and their corresponding heights, so when we get the mathematical model we will be able to guess the boys height from his age.

Data

google.spreadsheet <- function (key) {
  library(RCurl)
  # ssl validation off
  ssl.verifypeer <- FALSE

  tt <- getForm("https://spreadsheets.google.com/spreadsheet/pub", 
                hl ="en_GB",
                key = key, 
                single = "true", gid ="0", 
                output = "csv", 
                .opts = list(followlocation = TRUE, verbose = TRUE)) 

  read.csv(textConnection(tt), header = TRUE)
}

# load the data
mydata = google.spreadsheet("0AnypY27pPCJydDB4N3MxM0tENlk3UElnZ013cW1iM3c")

# include ggplot2
library(ggplot2)

ex2plot = ggplot(mydata, aes(x, y)) + geom_point() + 
       ylab('Height in meters') +
       xlab('Age in years')

Theory

The model we will get at the end is a line that fits the data, is defined like so:

Setting $x_0 = 1$:

$h_\theta(x) = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + ...$

That can be summarized by (last is matrix notation):

$h_\theta(x) = \sum_{i=0}^n \theta_i x_i = \theta^T x$

Matrix representation is useful because has good support in software tools.

Goal is to get the line closest to observed data points as possible, thus we can define a cost function that returns the difference of the real data vs myModel:

$J(\theta) = \frac{1}{2m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2$

where i is each data example we have and m is their total.

With J we now have a metric to check if the hypotheses line is getting closer to data points or not.

Next step is to find the smaller cost as possible from J, and in fact thats exactly what the gradient descent algorithm does: starting with an initial guess it iterates to smaller and smaller values of a given function by following the direction of the derivative:

$x_i := x_{i-1} - \epsilon f^' (x_{i-1})$

Applying to our J:

$\theta_j := \theta_j - \alpha \frac{\delta}{\delta \theta_j} J(\theta)$

And doing a bit of calculus on derivatives we get:

$\theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}$

Where alpha defines the size of steps of the convergence to $\theta$.

Now lets check if all this math really works.

Implementation - take 1

alpha = 0.07
m = length(mydata$x)
theta = c(0,0)
x = mydata$x
y = mydata$y 
delta = function(x,y,th,m) {
  sum = 0
  for (i in 1:m) {
    sum = sum + (((t(th) %*% c(1,x[i])) - y[i]) * c(1,x[i]))
  }
  return (sum)
}

# 1 iteration
theta - alpha * 1/m * delta(x,y,theta,m)
       
          1
[1] 0.07452802 0.38002167

Implementation - take 2

After having a peek at the Matlab solution, i learned that is possible to replace the sum in the equation with a transpose matrix multiplication(like done with the line equation):

$\theta := \theta - \alpha \frac{1}{m} x^T (x\theta^T - y)$

So we can get a full matrix implementation:

alpha = 0.07
m = length(mydata$x)
theta = matrix(c(0,0), nrow=1)
x = matrix(c(rep(1,m), mydata$x), ncol=2)
y = matrix(mydata$y, ncol=1)
delta = function(x,y,th) {
  delta = (t(x) %*% ((x %*% t(th)) - y))
  return(t(delta))
}

# 1 iteration
theta - alpha * 1/m * delta(x,y,theta)



           [,1]      [,2]
[1,] 0.07452802 0.3800217

The Model

First we run several iterations, until convergence:

for (i in 1:1500) {
  theta = theta - alpha * 1/m * delta(x,y,theta)
}
theta

          [,1]       [,2]
[1,] 0.7501504 0.06388338

And finally we see how well the line(model) fits the data:

ex2plot + geom_abline(intercept=theta[1], slope=theta[2])

References

MATLAB / R Reference, by David Hiebeler
Short Math Guide for LaTex(.pdf)
Matrix multiplier tool
Thanks to Andrew Ng and OpenClassRoom for the great lessons.

Weight Loss Predictor

2011-02-05T00:00:00+00:00

Got for 2010 Xmas a very cool book called the “4 Hour Body”(thanks Jose Santos) written by Tim Ferriss who write a previous favorite of mine about productivity, the 4 hour work week.

I like the book’s approach, it doesn’t just say do this do that and you’ll be healthy, it actually says: I(Tim Ferriss) have tried this, exactly with these steps, during this time, this is how i measured, these are the results i got and by looking at most up-to-date medical research this is the most likely explanation for these results… Notice the similar principles of AB testing, as in, try things out and measure results, and see if they work for you.

Also, this book couldn’t have arrived in a better time as i just peeked my heaviest weight in a long time, blame it on [insert favorite reason]… so, long story short and I am now on the 3rd week of the low-carb diet described in the book.

But of course, like with all diets, I’m quickly growing impatient of when i’m going to reach my goal of adequate BMI, so lets use R and monte carlo simulations to generate predictions.

Data

Have been tracking my weight using google spreadsheets, so i can get the data into R like so:

google.spreadsheet <- function (key) {
  library(RCurl)
  # ssl validation off
  ssl.verifypeer <- FALSE

  tt <- getForm("https://spreadsheets.google.com/spreadsheet/pub", 
                hl ="en_GB",
                key = key, 
                single = "true", gid ="0", 
                output = "csv", 
                .opts = list(followlocation = TRUE, verbose = TRUE)) 

  read.csv(textConnection(tt), header = TRUE)
}

# load the data
mydata = google.spreadsheet("0AnypY27pPCJydEstZnVOeHYycjVWWktjbHpvS1NRMUE")

# Create a new column with the proper date format
mydata$timestamp = as.Date(mydata$timestamp, format='%d/%m/%Y')

The last 5 measurements:

tail(mydata, 5)

     timestamp   kg
140 2011-03-04 77.3
141 2011-03-05 76.9
142 2011-03-06 77.2
143 2011-03-07 77.3
144 2011-03-08 76.9

Past Years weight

Lets have a look at the weight fluctuations over the past 3,5 years(before diet).

# include ggplot2
library(ggplot2)

beforediet = subset(mydata, timestamp < "2011-01-18")
ggplot(beforediet, aes(x=timestamp, y=kg)) + geom_point() + geom_smooth()

Weight has been mostly(in average) 80.5kg, but in 2nd half of 2010 we see a big jump. Also note that middle of year(summer here) appears to be where jumps in weight happen.

Predicting the Future

Now using Monte Carlo methods lets simulate the future based on the weight changes that happened since start of diet.

First we need to do some trickery to fill in the missing days and calculate the changes for every single day. The idea for filling in missing days is to interpolate between the known days and fill in the missing values. Eg: day1=81 and day3=80, we set day2=80.5.

So lets get the weight(kg) change(delta), for every day:

library(zoo) # for missing values interpolation

fill.all.days = function (mydata, timecolname, valuecolname) {
  dtrange = range(mydata[,timecolname])

  # create a data frame with every single day
  alldays = data.frame(tmp=seq(as.Date(dtrange[1]), as.Date(dtrange[2]), "days"))
  colnames(alldays) = c(timecolname) # rename tmp to proper timecolname

  # add the existing values
  alldays = merge(alldays, mydata, by=timecolname, all=TRUE)

  # fill in the missing ones
  alldays[,valuecolname] = na.approx(alldays[,valuecolname])
  return(alldays)
}

# from start of diet
dietdata = subset(mydata, timestamp >= "2011-01-17")
lastweight = tail(dietdata$kg, n=1)

# fill in missing days
dietalldays = fill.all.days(dietdata, "timestamp", "kg")

# get difference day by day into data frame
kgdelta = diff(dietalldays$kg)
dietalldays$delta = c(0, kgdelta)

# print only the 10 last values
tail(dietalldays, 5)    

    timestamp   kg delta
47 2011-03-04 77.3   0.0
48 2011-03-05 76.9  -0.4
49 2011-03-06 77.2   0.3
50 2011-03-07 77.3   0.1
51 2011-03-08 76.9  -0.4

What is going to be my weight in a week?

predict.weight.in.days = function(days, inicialweight, deltavector) {
  weight = inicialweight
  for (i in 1:days) {
    weight = weight + sample(deltavector, 1, replace=TRUE)
  }
  return(weight)
}

# simulate it 10k times
mcWeightWeek = replicate(10000, predict.weight.in.days(7, lastweight, kgdelta))

summary(mcWeightWeek)   

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   71.8    75.2    75.9    75.9    76.6    79.7

Another good thing about monte carlo methods is that they give a distribution of the prediction, so is possible to get a feeling of how sure the average is, by comparing it with a expected normal distribution:

gghist = function(mydata, mycolname) {
  pl = ggplot(data = mydata)
  subvp = viewport(width=0.35, height=0.35, x=0.84, y=0.84)

  his = pl + 
        geom_histogram(aes_string(x=mycolname,y="..density.."),alpha=0.2) + 
        geom_density(aes_string(x=mycolname)) + 
        opts(title = names(mydata[mycolname]))

  qqp = pl + 
        geom_point(aes_string(sample=mycolname), stat="qq") + labs(x=NULL, y=NULL) + 
        opts(title = "QQ")

  print(his)
  print(qqp, vp = subvp)
}

gghist(data.frame(kg=mcWeightWeek), "kg")

And when am i getting to 75kg?

days.to.weight = function(weight, inicialweight, deltavector) {
  target = inicialweight
  days = 0
  while (target > weight) {
    target = target + sample(deltavector, 1, replace=TRUE)
     days = days + 1
     if (days >= 1095) # if value too crazy just interrupt the loop
        break
  }
  return(days)
}

# simulate it 10k times
mcDays75 = replicate(10000, days.to.weight(75, lastweight, kgdelta))

summary(mcDays75)


   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.00    8.00   12.00   15.36   19.25  102.00

And the cumulative distribution:

# add dates to it, from today's date + #days
days75 = sort(tail(mydata$timestamp, 1) + mcDays75)

# get the ecdf values into a dataframe
days75.ecdf = summarize(data.frame(days=days75), days = unique(days), 
                        ecdf = ecdf(days)(unique(days)))

# date where its 80% sure i'll reach goal
prob80 = head(days75.ecdf[days75.ecdf$ecdf>0.80,],1)

# plot
ggplot(days75.ecdf, aes(days, ecdf)) + geom_step() +
       ylab("probability") + 
       geom_point(aes(x = prob80$days, y = prob80$ecdf)) +
       geom_text(aes(x = prob80$days, y = prob80$ecdf, 
                    label = paste("80% sure to be 75kg on",
                            format(prob80$days, "%a, %d %b %Y"))), 
                     hjust=-0.04)

Also note that, weight loss is faster at the beginning of a diet, it tends to slow down over time, so to keep the predictions valid we need to continue record the weight and re-run the predictions frequently.

But as you see the slow carb diet seems to work, even without exercise. Tim’s book is great, focusing on the smallest things possible for the bigger results(=efficiency).

Update

Got to 75.0kg on 25 March. Thats 67 days (aprox. 9 weeks) for a 9kg loss, thus aprox. 1kg per week. Which is within the recommended(0.9kg per week) weight loss recommendations. Thus am now within normal BMI values.

References

Hard drive occupation prediction with R: part 1 and part 2, and thanks to Leandro Penz on the feedback.
Big thanks for Jose and Tim, on jumpstarting this experiment!

On Tools. Featuring guitar pedals, cattle growing and math

2011-01-12T00:00:00+00:00

The Idea

Bear with me during all the technical and guitar lingo, and keep reading because there is a point.

After few last weeks of obsessively investigating guitar pedals, because of: “what should i get this year for my birthday?” i decided to change some parts of my guitar setup and long story short I ended up with only 1 analog pedal in my setup, which like everybody knows is too little, so I came up with an idea: i could build my own guitar pedal that i can tweak and make custom and unique sounding, Lets Go!

Do I need to Grow Cattle?

After some web browsing on DIY analog pedal building and schematics, figured out that an analog pedal is not really worth, because without knowledge of analog electronics i will just end up building up some schematic i find on the web without truly understanding it and thus limiting my power to improve it.

So maybe a better idea is to do instead a software guitar pedal, where i can do all the programming and thus be in a better position to tweak it to create my own sounds. Right? Wrong!

I had a look at the audio plugin programming world and looked to me like: incompatible platforms, buffers, streaming, real-time, whole new set of programming environments to learn, etc etc… a big mess… i don’t care about all that, i just want to play around with sound design ideas.

Is like wanting to make my own pizza, but for the cheese I have to grow cattle, get the milk and make the cheese. When what i really want, is just play around with different cheeses and ingredients to create my own pizza. Its a whole different set of skills and goals…

Finding Gold

Then at some point landed on a page that described mathematical models of “amplifier tubes” and realized, this is a model simulating a real sound behavior, so is exactly focused on the right thing, and also, a theoretical math model stands the test of time, it can be implemented in any computer simulation or even in a stand alone digital chip. The math modeling work is truly platform independent and without expiration date, this is real gold.

And look at the implementation side of it. Implementing is actually monkey / mechanical work, i.e. nothing new about implementing a math formula in a programming language and besides, most of the effort will unfortunately be on the audio plugin programming environment(the growing cattle part)… on top of that, even if I implement it on the current most modern platform, most likely will only be running for a couple of years, because when a new audio plugin platform arrives, it gets incompatible, and i will have to do it again… Also, i don’t wanna go trough all the cattle growing process just to find at the end that actually that model does not work as i though… There has to be a better way…

Making it Real

Then again, just the Math by itself will not make my new guitar pedal, so we still need something else that will run that theoretical model on the computer… Finally i stumbled on guitar effects simulations made in Matlab, looked at the source code… and… Brilliant! Short and directly to the point, high level algorithm implementation of audio effects. This is great, it will allow to try out the math models i found before much easier & quicker compared to using the full stack of audio plugin development.

With the caveat that Matlab is not really suited for building a production-standard-final-product, is targeted instead at creating a working prototype, which is perfectly fine, because if the models really work out well then, and only then, i could create a company, get investors / money and hire someone who knows better and is interested about the cattle growing part more than me to create that production-standard-final-product and maintain it.

In the meanwhile with math and quick prototyping tool i can focus just on modeling and trying out guitar pedal ideas.

The point: Tools are the means to an end, not the end itself, and this is very easy to forget. Tools are not the end goal. But tools do matter, a bad tool can distract you from focusing on the right things and slow you down, but a good tool, can make you work in a better way and get to the goal faster.

A suficiently different and good tool, can make you think differently.

Homemade Auto-Updater

2010-12-01T00:00:00+00:00

Here’s a script that i use frequently to update an application to the last version. It automates the process of downloading and installing the app.

Aquamacs

Every day there’s a new release of Aquamacs, its called the nightly build, made from the latest developed code. And is always available from the same url.

So the script does the following: creates temporary folder for it, downloads the latest version, unpacks it, requests to close Aquamacs if running(so it can replace it), creates a backup version AquamacsOld.app (in case the new ones has troubles i can use the previous), copies the files to the applications folder and cleans up the temporary downloaded files.

#!/bin/bash
mkdir /tmp/emacsdownload && cd /tmp/emacsdownload
curl http://braeburn.aquamacs.org/~dr/Aquamacs/24/Aquamacs-nightly.tar.bz2 --silent -o /tmp/emacsdownload/aquamacs.tar.bz2
tar xjf /tmp/emacsdownload/aquamacs.tar.bz2
RUN=`ps -ef | grep Aquamacs | grep -v grep`
if [ -n "$RUN" ]; then
  x=`/usr/bin/osascript <<EOT
    tell application "Finder"
      activate
      set myReply to button returned of (display dialog "Please close Aquamacs to update" default button 2 buttons {"No", "Ok"})
    end tell
EOT`
  if [[ $x = "No" ]]; then exit; fi
fi
rm -rf /Applications/AquamacsOld.app
mv -f /Applications/Aquamacs.app /Applications/AquamacsOld.app
cp -R /tmp/emacsdownload/Aquamacs.app /Applications
rm -rf /tmp/emacsdownload

Chromium

For Chromium on each build the download url changes, so we have to add extra logic for this, first it figures out the latest version and then uses that information for the download url, the rest is similar to the Aquamacs script.

#!/bin/bash
CHROMEDIR="http://build.chromium.org/f/chromium/snapshots/Mac/"
mkdir /tmp/chromedownload && cd /tmp/chromedownload
curl $CHROMEDIR/LATEST -o /tmp/chromedownload/LATEST --silent && LATEST=`cat /tmp/chromedownload/LATEST`
curl $CHROMEDIR/$LATEST/chrome-mac.zip --silent -o /tmp/chromedownload/chrome-mac.zip
unzip -qq /tmp/chromedownload/chrome-mac.zip
RUN=`ps -ef | grep Chromium | grep -v grep`
if [ -n "$RUN" ]; then
  x=`/usr/bin/osascript <<EOT
    tell application "Finder"
      activate
      set myReply to button returned of (display dialog "Please close Chromium to update" default button 2 buttons {"No", "Ok"})
    end tell
EOT`
  if [[ $x = "No" ]]; then exit; fi
fi
rm -rf /Applications/Chromium.app
cp -R /tmp/chromedownload/chrome-mac/Chromium.app /Applications
rm -rf /tmp/chromedownload

Automate it

To automate we can add it into a cron job like so:

01      11      *       *       *       update-chromium

That runs every day at 11h01

Monitoring Productivity Experiment

2010-10-20T00:00:00+00:00

For over a year now, i’ve been collecting how much time i spend in computer and how much of it is actually used in creative/productive activities.

By productive activity i mean that the time spent in text editor(emacs), terminal, excel or a database client is likely to be more creative/productive than the time spent in youtube, twitter, reading rss feeds, IM Chatting or replying Email. In average.

This is overly simplified, but the tools i’m using work specially well for this, including automatic data collection without the need for manual data entry.

Tracking Time

I’m using the RescueTime application that tracks when there’s user activity on a particular computer application. And then i copy the data onto a google doc spreadsheet, keeping only a summary per week. RescueTime like any other app, can have its hiccups, and i’ve noticed a couple of rare occasions when it was not tracking well, but overall works well.

Data

Per week i collect the hours of total, productive and distracting time.

Besides productive and distracting, there’s also the neutral time, that is something in between, for example, things like moving files around(in finder the osx equivalent of windows explorer), a google search or even a data gap that i am not able to classify they all go into the neutral time bucket.

thus, total = productive + distracting + neutral

I’ll look here at a full year(52 weeks worth of data).

google.spreadsheet <- function (key) {
  library(RCurl)
  # ssl validation off
  ssl.verifypeer <- FALSE

  tt <- getForm("https://spreadsheets.google.com/spreadsheet/pub", 
                hl ="en_GB",
                key = key, 
                single = "true", gid ="0", 
                output = "csv", 
                .opts = list(followlocation = TRUE, verbose = TRUE)) 

  read.csv(textConnection(tt), header = TRUE)
}

# load the data
mydata = google.spreadsheet("0AnypY27pPCJydGNCcDhIVVRyZ1ZMWnBTbjBQbmJ0WVE")

# Create a new column with the proper date format
mydata$date = as.Date(mydata$date, format='%d/%m/%Y')

# include ggplot2
library(ggplot2)

How is data distributed

pl <- ggplot(data = mydata)
#subplot viewport
subvp <- viewport(width=0.4, height=0.4, x=0.22, y=0.80)

his = pl + 
      geom_histogram(aes(x=total,y=..density..),alpha=0.2,binwidth=2) + 
      geom_density(aes(x=total)) + 
      opts(title = "Total")
qqp = pl + 
      geom_point(aes(sample=total), stat="qq") + labs(x=NULL, y=NULL) + 
      opts(title = "QQ")
print(his)
print(qqp, vp = subvp)

his = pl + 
      geom_histogram(aes(x=productive,y=..density..),alpha=0.2,binwidth=2) + 
      geom_density(aes(x=productive)) + opts(title="Productive")
qqp = pl + 
      geom_point(aes(sample=productive), stat="qq") + labs(x=NULL, y=NULL) + 
      opts(title = "QQ")

print(his)
print(qqp, vp = subvp)

his = pl + 
      geom_histogram(aes(x=distracting,y=..density..),alpha=0.2,binwidth=2) + 
      geom_density(aes(x=distracting)) + opts(title = "Distracting")
qqp = pl + 
      geom_point(aes(sample=distracting), stat="qq") + labs(x=NULL, y=NULL) + 
      opts(title = "QQ")

print(his)
print(qqp, vp = subvp)

For the exception of a couple loose ends, we see that the data follows the normal distribution quite well. Which allows for a few assumptions when analyzing it. And we could even cut off those loose ends(by excluding data), for even a more perfect match to the normal distribution.

How many hours spent in computer?

(sum(mydata$total) / 24) / 365

Almost 1/4(~25%) of the whole year in front of computer. Wow!

How many hours per day?

ttest <- t.test(mydata$total, conf.level = 0.95)
print(ttest)
         
          1
[1] 0.2391553
 
        One Sample t-test

data:  mydata$total 
t = 27.3738, df = 51, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0 
95 percent confidence interval:
 37.33372 43.24320 
sample estimates:
mean of x 
 40.28846

Values are between [37.33372, 43.24320] for 95% confidence. Which means that ~40 is a very good estimation of the average time.

So thats close to 40 hours per week, almost 6 hours per day in computer. And this is in average for the whole year, that is, it includes weekends, vacations, holidays, etc…

Note: during the (8h)work day we are not 100% of the time active in computer, from my own data, RescueTime says that for a full hour in front of computer without interruptions, it captures in average 45min of activity. So, from a 8h working day you get already only 6h of active computer time, if you then add in the meetings, breaks, ocasional discussions, etc… that value goes lower.

Searching for Correlations

plotmatrix(mydata[2:4]) + geom_smooth(method="lm")

cor(mydata[2:4])        
                total productive distracting
total       1.0000000  0.8719531   0.6884407
productive  0.8719531  1.0000000   0.4027419
distracting 0.6884407  0.4027419   1.0000000

Total and Productive time seem to be strongly correlated, what it means? there’s 2 ways to look at it:

increasing productive time the total goes up.
increasing total time the productive goes up.

So, 1. is obvious and not interesting, but could 2. be true?

Well, if we compare productive vs distracting, we see that productive(0.872) has a stronger correlation to total time than distracting(0.688). And because increasing distracting time will always increase the total(in exactly the same way as productivity will, as 1.) then it means that increasing the total is more likely to increase productivity time then the distracting time.

Trends

ggplot(mydata, aes(x=date)) +  labs(x=NULL, y=NULL) + 
  opts(legend.position="bottom") +
  geom_line(aes(y = total, colour="total")) +
  geom_smooth(aes(y = total, colour = "total")) + 
  geom_line(aes(y = productive, colour="productive")) +
  geom_smooth(aes(y = productive, colour = "productive")) +
  geom_line(aes(y = distracting, colour="distracting")) +
  geom_smooth(aes(y = distracting, colour = "distracting"))

The big drop towards the end is a 2 week vacation, where i barely used computer.

In the first half of the plot there is a drop in productivity, accompanied by an increase on distracting time.

It also shows that close to the end(last couple of months) there’s a tendency for increase in all categories.

The Gear

This post was also made to try out the OrgMode Babel mode that i’ve discovered recently, that allows for literate programming(mixing in same document live/executable code and text).

This doc was written in (Aqua)Emacs using Orgmode. R as the statistics toolbox, loaded with the nice ggplot2 graphics package. This allows for a very smooth work flow for creating this type of documents and it works very well :)

How to download videolectures.net videos with VLC

2010-08-08T00:00:00+00:00

videolectures.net has very good content but, no good way to download the videos(at least as of now).

And oftentimes you want to watch them offline, so here’s a way to dowload them if you have VLC video player installed.

This of course works for all videos that are streamed from an mms:// address and not only the videolectures ones.

Btw, found same idea, but using mplayer instead, from bradfordcross here.

How to

Find the mms:// address for the video(from the web page source) and do:

$ /Applications/VLC.app/Contents/MacOS/VLC -I rc mms://velblod2.ijs.si:80/2009/pascal2/mlss09uk_cambridge/mackay_it/mlss09uk_mackay_it_01.wmv --sout ~/Desktop/information-theory.avi

If you are not on a Mac, then you need to update the paths.

Make a bash script from it

Its annoying to write all of the above every time you want to download a video, so is worth to make a bash script from it.

Create a new file, with content:

# getvideo.sh 
/Applications/VLC.app/Contents/MacOS/VLC -I rc $1 --sout ~/Desktop/$2.avi vlc://quit;

Note that i added the vlc://quit; at the end, this will make it the script exit when finished.

Make it executable and reachable from everywhere:

$ chmod 711 getvideo.sh 
$ ln -s getvideo.sh /usr/bin/getvideo

And use like so:

$ getvideo mms://velblod2.ijs.si:80/2009/pascal2/mlss09uk_cambridge/mackay_it/mlss09uk_mackay_it_01.wmv information-theory

Clojure and Selenium part ii - cov3

2010-05-22T00:00:00+00:00

Have been playing around with selenium and clojure for a while, and now that selenium 2 is in beta ended up making a little web crawler library called cov3.

It has 3 flavors of crawling:

the usual crawler, give him a url, and keeps on going until he visits all of the linked pages (that point to same domain).
a sitemap crawler, give him a sitemap.xml and visits the urls he finds in the sitemap.
a step crawler, give him a csv file with the list of urls(steps) to visit.

On each page the crawler visits it executes a bit of javascript code that we can define as a validation. These validations are usefull to test your site in an automated way, say for example, you want to check:

if all pages contain meta tags
if all pages contain a title
if you have web analytics tracking in all pages
find out what links are broken
test your own javascript in an automated way in all pages.
etc..

Also allows to use, for the crawling, either Firefox, Internet Explorer(when on windows), Chrome or HtmlUnit(a GUI-Less browser).

Usage

(require '[cov3 :as cov3])

;; then (:ff is short for firefox, use :hu for HtmlUnit, 
;; :ch for Chrome, and :ie for Internet Explorer)
(cov3/crawl :ff "http://al3xandr3.github.com/" '("document.title"))

;; or (10 is the sample size to pick from sitemap.xml)
(cov3/sitemap-crawl :ff "http://al3xandr3.github.com/sitemap.xml" "" "" 10 '("document.title"))

;; or (assuming you have a csv file with the steps to take, see more on documentation)
;; for example the line: http://al3xandr3.github.com/,"document.title",,
(cov3/step-crawl :ff "data/steps.csv")

Is available from github: http://github.com/al3xandr3/cov3

jQuery Twitter 'mini' plugin

2010-04-10T00:00:00+00:00

Here’s a little jQuery plugin for displaying a twitter feed into a web page.

The goal was to put my latest ‘tweets’ on my blog, and also learn jQuery.

Ended up making a ‘mini’ jQuery plugin that can easily be added into any web page.

Demo:

For the code:

$(function() {
  $('#tw').click(function() {
    $('#tw').twitter({'user':'al3xandr3','count':2});
  });
});

How It Works

It makes an Ajax request to twitter that returns json data of the feed. That data is then read and injected into the selected html element(s).

See in:

$.ajax({
  url: "http://twitter.com/status/user_timeline/" + settings.user + 
       ".json?count="+ settings.count +"&callback=?",
  dataType: 'json',
  success: function (data) {
    $this.html(""); //clean previous html
    
    $.each(data, function (i, item) {
      
      //text
      $this.append("<p id=" + item.id + ">" + 
                   replaceURLWithHTMLLinks(item.text) + 
                   "</p>");

      //date
      if (typeof prettyDate(item.created_at) !== "undefined") {       
        $("<div><a style='font-size: 80%;' href='http://twitter.com/" +
          settings.user + "/status/" + item.id + "' target='_blank'>" +
          prettyDate(item.created_at) + "</a></div>").appendTo("#" + item.id);
      }
    });}
});

jQuery is a very nice designed lib, simple and powerfull. Some say its just like a functional programming Monad.

Full source code is available from github: http://github.com/al3xandr3 /jquery-twitter-plugin

AB testing tools in the Future

2010-03-16T00:00:00+00:00

A view on AB testing tools of the future.

How it works:

You plugin the AB testing tool to your application and say: optimize page A, on the measurable goal X (for example downloads).
The tool by itself: creates new UI variation -> tests it -> analyses results -> makes it default -> creates new UI variation -> tests it -> etc… This goes ad eternum… Much like natural evolution, keeps experimenting/mutating, until it finds the UI that works best for the defined goals.

Details:

New UI variations do not need(shouldn’t even) be 100% random, they should use smarter techniques like: genetic(and other search/optimization) algorithms + tried out design heuristics + branding guidelines(avoid color A, use font B, etc..) + (sampled)user filtering + some amount of randomness + etc..
Knowledge Base: Build a Database with the test results, that collects knowledge of what worked and what didn’t (for a given context). Just as Pandora collects user input for building its recomendation system, this accumulated knowledge would serve as input for the task of generating the new UI variations. Note: The amount of data is key; the bigger the amount of test results, the closer to all possible variations thus the closer to all the best optimizations. With a large amount of test and tried out results quicky we would get the perfect UI rules.
Page Flow: Tool should optimize not only the page itself, but also navigation along pages, customizing content depending on the flow For example, forward the user to a different page, depending on the keyword used in a search engine when arriving at the website.
Personalized UI: What works for user A might not work for B. A 16 years old likes different things than a 50 years old. Even for a unique user, his tastes changes over over time: winter vs summer, week vs weekend, working hours vs non-working hours etc… So the perfect interface might need to be changing over time(? Don’t assume, experiment and see if it works…).

Automating todo tasks reports with org-mode

2010-03-08T00:00:00+00:00

Here’s the geek automation of the week, its for helping creating reports from my TODO tasks list when using the amazing emacs org-mode(see here whats orgmode all about).

(simplified) Work Flow

I get a request, add it into my todo list queue, marking it as a TODO item.
Work, work, work, guided by the todo listed tasks, balancing priority and effort and (..add your own reason here..).
When finished, mark an item DONE.
Generate a report every week with the done tasks.

Generating the Report

(I use this setup on Mac, with some adaptations should also on Linux and Windows)

Every week i then generate a report of the DONE Tasks, by running:

# file: get-work-done.sh 
# run: sh get-work-done.sh

# Uses emacs to extract the DONE items from work.org, generating a work-done.csv
/Applications/Aquamacs.app/Contents/MacOS/Aquamacs -batch -l ~/.emacs -eval '(org-batch-agenda-csv "+TODO=\"DONE\"" org-agenda-files (quote ("/.../work.org")))' > work-done.csv

# Applies desired report formatting to the exported work-done.csv, generating work.csv
ruby format-report.rb work-done.csv

# Clean up the originally exported file
rm work-done.csv

# Opens the final file in the default .csv handler, typically Excel.
open work.csv

See the comments “#” to understand what it does in each step.

Then I use the format-report.rb that apply’s the final formatting to the report, for example: add my own header, add/remove columns, Dates, change names, calculate values, etc, etc… see example:

# file: format-report.rb

flines = File.open(ARGV[0]).readlines

column_map = { 
  "from_name1"  => "to_name1", 
  "from_name2"  => "to_name2",  
}

File.open( "work.csv","w+") do |fl|  
  fl.puts "header1,header2,header3,header4"
  flines.each do |l|
    a = l.split(",")

    # Time, mapping-defined-in-column_map, original-column-2
    fl.puts Time.now.strftime("%m/%d/%Y") + "," + 
            (column_map[a[1]] || a[1]) + "," + a[2] "," + a[3]
  end
end

And voila, i run this and an excel(my default .csv app) sheet opens up with the report of the week.

Probability simulation of basketball throws

2009-08-27T00:00:00+00:00

A little probability simulation from the book Resampling: The New Statistics, using Clojure and Incanter.

Example 7-3: What is the probability that a basketball player will score three or more baskets in five shots from a spot 30 feet from the basket, if on the average she succeeds with 25 percent of her shots from that spot?

Three or More Successful Basketball Shots in Five Attempts, that is, a Two- Outcome Sampling with Unequally-Likely Outcomes, with Replacement, Binomial Experiment.

Lets start with the Resampling book code solution, and then implement it using clojure and incanter.

REPEAT 100000
  GENERATE 5 1,4 a
  COUNT a =1 b 
  SCORE b z
END
COUNT z >= 3 k
DIVIDE k 100000 kk
PRINT kk

Now onto the clojure solution.

Starting simple, 1 throw

Before going for the full thing I tried out first a simulation of just 1 throw and visualized it.

It simulates 100 thousand times a basketball throw with the probabilities described above.

The category-count function is a helper function to count :miss and :hit’s but its not absolutely needed.

(use '(incanter core stats charts io))

(defn category-count
  [aseq]
  "counts the category's from a given sequence, ex:
   $ (category-count '(:hit :hit :miss)) => {:miss 1, :hit 2}"
  (into {} (let [dis (distinct aseq)]
  (for [item dis]
      {item (count (filter #(= item %) aseq))}))))

(def one-throw
  (for [_ (range 100000)]  
    ; does not matter if replacement is true or false
    (sample [:hit :miss :miss :miss] :size 1 :replacement false)))

(def throws-count (category-count one-throw))
(view (bar-chart (keys throws-count) (vals throws-count)))

Results make sense, its an 1/4 (25%) of probabilities of making the basket, so simulations seems to be working correctly.

With 5 Throws

Same logic, its the sample method from Incanter doing all the magic of picking randomly if each throw is a :miss or a :hit. Then test if number of hits is >= 3, identify it as an :ok and finally see the percentage of all :ok’s from the total simulation.

(defn five-throws
  []
  (category-count 
    (for [_ (range 100000)]  
      (let [smp (sample [:hit :miss :miss :miss] :size 5 :replacement true)
            catg (category-count smp)
            hits (:hit catg)]
        (if (and (not (nil? hits)) (>= hits 3))
          :ok
          :nok)))))

(double (/ (:ok (five-throws)) 100000))

Result is 0.10283, so there’s only a ~10% chance of that basketball player making 3 or more baskets in a 5 throws sequence.

Clojure test-is results in twitter

2009-05-15T00:00:00+00:00

Here’s how you can have your your test-is results posted on twitter automatically as they run.

This is useful for example if you have regression tests that run automatically in a remote machine, once a week(for example) and want to find a good way to see the results.

Start by getting jtwitter, put it in your classphath and create:

(import 'winterwell.jtwitter.Twitter)

(defn twitter-update [the_message]
  (let [twitter (new Twitter "username" "password")]
    (.updateStatus twitter the_message)))

Update “username” and “password” with your twitter login.

Then hook it up to the /test-is’/library(test library in clojure): just overwrite the report summary method that by default prints out the summary of the executed tests.

Before:

(defmethod report :summary [m]
  (with-test-out
    (println "\nRan" (:test m) "tests containing"
    (+ (:pass m) (:fail m) (:error m)) "assertions.")
    (println (:fail m) "failures," (:error m) "errors.")))

After:

(defmethod report :summary [m]
  (with-test-out
    (twitter-update (str "[coverager] " (:fail m) " failures, " (:error m) " errors."))
    (println "\nRan" (:test m) "tests containing"
    (+ (:pass m) (:fail m) (:error m)) "assertions.")
    (println (:fail m) "failures," (:error m) "errors.")))

Note that i just added added a line calling the twitter-update method.

And thats it, now every time you run your tests, you will have the failures and errors twittered:

I’ve created a 2nd account on twitter where i post these automated messages. And have my clojure tests(regression tests) running by themselves and i just check twitter to see if all is good.

This little twitter-update method is very easy to re-use for other alerts and automations.

Clojure and Selenium

2009-04-24T00:00:00+00:00

I needed a kind of crawler to go around a list of pages, invoke some javascript(tests) and collect the output.

Curl or a regular http lib’s don’t do the trick because i need to run javascript on each requested page. For that i can use Selenium, Selenium is a great framework to perform web testing, that uses directly a browser and thus we can run Javascript.

Selenium can be scripted from Java which matches very well with my wish to learn Clojure :)

Solution

What i implemented is not really a crawler in the sense that it does not go around automatically following all the links it finds, it actually gets the list of links to check from the site sitemap.xml. But is not that hard to use this as a base for a crawler.

As some sitemaps.xml are huge, i added also a little pick-a-sample function that randomly selects only a subset from all the sitemap.

Code

Im on the process of learning Clojure, so probably a lot of things could be improved.

For Selenium, we need first to start the server, then the client, and then use the client to browse the pages. As is not very elegant to have a start server and“start client on the top of the script and a“stop client and stop server call at the end of the script, i’ve wrapped those around a macro (one of the major strengths of Lisp like languages).

The whole thing goes like this:

process-sitemap receives a sitemap, transforms it into a map(with xml-to-zip), collects the url links in it, then picks a sample from them(with pick-a-sample) and calls check-pages with them.

check-pages gets a list of urls. It starts by using the macro, obtains a-browser from it, then iterates over the list of urls, calling check-a-page on each url(a-url). Note that at this point the standard output is redirected to a file, so i can log the results from check-a-page.

check-a-page gets a-browser and a-url, so you can guess what it will do :)It opens that url in the browser, calls the javascript, and prints to standard output the return of the js call.

Hope google does not mind to use their site as an example. But do not run this on Google site, its just an example, use it on your own site!

For this to run you will need to have in your classpath a bunch of jar libs, this is how my lib folder looks like:

lib/
  clojure-contrib.jar
  clojure.jar
  jline-0.9.94.jar
  selenium-java-client-driver.jar
  selenium-server.jar

I called this app“/coverager/

Code:

;;file: coverager.clj
(ns coverager
  (:import (com.thoughtworks.selenium DefaultSelenium)
    (org.openqa.selenium.server SeleniumServer)
      java.util.Date
      (java.io FileWriter)
      (java.text SimpleDateFormat))
  (:use clojure.contrib.zip-filter.xml)
  (:require [clojure.zip :as zip]
            [clojure.xml :as xml]))

(defmacro with-selenium
  [browser & body]
  `(let [server# (new SeleniumServer)]
    (.start server#)
    (let [~browser 
         (new DefaultSelenium "localhost", 4444, "*firefox", "http://www.google.com/")]
      (.start ~browser)
      (.setTimeout ~browser "100000")
      ~@body
      (.stop ~browser))
      (.stop server#)))

(def *js-eval* "this.browserbot.getCurrentWindow().document.title;")											
(defn check-a-page [a-browser a-url] 
  (try 
  (.open a-browser a-url)
    (Thread/sleep 3000) ; make a little timeout, to avoid overloading server
    (println (str a-url "," (.getEval a-browser *js-eval*)))
    (catch Exception e 
    (println (str a-url "," e)))))

(defn check-pages [url-list]
  (with-selenium browser
    (binding [*out* (FileWriter. 
         (str "output/log_" (.format (SimpleDateFormat. "yyyy-MM-dd") (Date.)) ".csv"))]
      (doseq [a-url url-list]
        (check-a-page browser a-url)))))

(defn xml-to-zip [url]
  "read xml url into a tree"
  (zip/xml-zip (xml/parse url)))

(defn pick-a-sample [a-percentage a-list]
  "picks a subset (a-)percentage of the total"
    (filter #(if (> (rand) (- 1 (/ a-percentage 100))) %) a-list))

(defn process-sitemap [sitemap-url]
  (let [u-list (xml-> (xml-to-zip sitemap-url) :url :loc text)]
    (check-pages (pick-a-sample 1 u-list))))

(def *sitemap* "http://www.google.com/sitemap.xml")

;use: (process-sitemap *sitemap*)

And of course tests for it:

;;file: coverager_test.clj
(ns coverager_test
  (:use clojure.contrib.test-is)
  (:use coverager)
  (:use clojure.contrib.zip-filter.xml)
  (:require [clojure.zip :as zip]
            [clojure.xml :as xml]))

(deftest browse-page
  (with-selenium abrowser  
    (.open abrowser "http://www.google.com/a/")
    (is (.startsWith (.getTitle abrowser) "Google Apps"))))

(def abit "<?xml version='1.0' encoding='UTF-8'?>
<urlset xmlns='http://www.sitemaps.org/schemas/sitemap/0.9'>
 <url>
  <loc>http://www.google.com/</loc>
  <lastmod>2009-04-03</lastmod>
  <priority>0.5000</priority>
 </url>
 <url>
  <loc>http://www.google.com/a</loc>
  <lastmod>2009-04-03</lastmod>
  <priority>0.5000</priority>
 </url>
</urlset>
")

(deftest xml-process
  (let [res (xml-to-zip (org.xml.sax.InputSource. (java.io.StringReader. abit)))]
    (let [lis (xml-> res :url :loc text)]
      (is (= (first lis) "http://www.google.com/"))
      (is (= (last lis) "http://www.google.com/a")))))

(deftest on-picking-sample
  (let [the-sample (pick-a-sample 10 '(0 1 2 3 4 5 6 7 8 9))]
    ;not completely garanteed will take only 1, 
    ;it should, on most cases but more important is
    ;to picking up randomly a small subset from list
    ;so less than 3 items is reasonable test
	(is (< (count the-sample) 3))))

(defn run-them []
  (run-tests 'coverager_test))

Take away

Clojure is great! Its my opinion that on the Lisp family of languages the code is more elegant and visually cleaner than the C family.

I don’t care much for working directly with the Java language, but working on the JVM with other languages like JRuby, Clojure, and harnessing all the vast amount of Java libs and infrastructure out there is a MAJOR advantage.

I suspect i will be spending more time with Clojure in future :)

Lem-E-Tweakit and Logic programming

2009-02-02T00:00:00+00:00

While watching the SICP lectures 8a & 8b, one thing that i realized is that this logic programming they mention seems to be very similar to kind of things we use SQL for, just better… ie. a lot more flexible.

So why we do use SQL after all?

After a bit of googling found this good article: http://search.cpan.org/dist /AI-Prolog/lib/AI/Prolog/Article.pod one of the references says:

“…So if Prolog(read AI) and SQL(DB) are so similar, Why is one so successful commercially and the other deemed a complete failure in terms of scalability?…”

Relational Databases implement powerful techniques to improve performance:

Indexing
Hashing
Reordering goals to reduce backtracking

Where as Prolog based systems have very few such techniques.

What is SQL missing?

One of the most powerful features of Prolog is recursion.

SQL does not have recursion built into it. This is a severe Handicap. However there is a way to overcome this problem by invoking multiple SQL queries from a host language like C Or Java. SQL3 has begun supporting recursion.

Why my keyboard has a QWERTY layout?

2009-01-21T00:00:00+00:00

The most common keyboard layout in use today is called QWERTY, it takes its name from the first six characters seen in the far left of the keyboard’s top row of letters.

Why QWERTY?

from Wikipedia:

The QWERTY layout was introduced in the 1860s, being used on the first commercially-successful typewriter, the machine invented by Christopher Sholes. The QWERTY layout was designed so that successive keystrokes would alternate between sides of the keyboard so as to avoid jams. Improvements in typewriter design made key jams less of a problem.

So the mechanical issues of the original typewriters was the main reason for this layout design.

A second popular layout is called Dvorak, designed in the 1930’s and it tries to address another problem:

“…the introduction of the electric typewriter in the 1930s made typist fatigue more of a problem, leading to increased interest in the Dvorak layout…”

But even the Dvorak layout is already some years old, so is the typist fatigue the same as now? Do we write, for example, the same words as in 1930’s? I think not…

Is there a better layout?

Take a look at the texts you write every day, and imagine if changing the keyboard would make it more practical for you.

For example, for this text i had to write the word WHY a lot ;) so if i had whyrtu instead of qwerty on my keyboard, would be an improvement…right? Maybe yes, but there’s other things to take into account, for example:

counting ALL of the most used words (not only the why).
key combinations that minimizes vertical finger, because jumping from 1st to 3rd row, is less efficient than a jump from 1st to 2nd row.
what key combinations minimizes horizontal finger movements?
frequent use of pinky, should be minimized, related to the above horizontal and vertical long movements.
etc… (many other heuristics)

And then try out several keyboard layouts and variations.

Imagining to do this by hand, like Dvorak and Sholes did for QWERTY is a pain. Luckily we have computers now and as a matter of fact someone already played around with this problem (which made think about all this subject in the first place).

See here for experience: http://klausler.com/evolved.html, and also the final keyboard layout: http://klausler.com/evolved.pdf

Your Own keyboard

I’m not sure that there is the ONE perfect keyboard for everyone, because:

different people use different words.
different languages use different words.
15 years old person words are different from a 70 years old person words.
use of computer during working hours compared out-of-work working hours…
etc…

So even for you, an optimal keyboard layout would probably change over time.

But imagine a future intelligent computer that can be all the time analyzing what you write in the keyboard and can auto-adapt itself to give your own very optimal layout. Not too frequently, of course, you don’ want your keyboard changing every day. But maybe every 3 years is not so impossible to imagine…

Keyboards of the future should come with blank keys that can be personalized with the character you want.

How would this work if someone changes computer? The keyboard profile, could be fetched from internet when you login to computer.

Another idea, is to have full words on keys, like with your top 5 most used words.

Touch screens with a touch screen there’s a whole new level of layout possibilities, we could make more frequent keys bigger than others for example, play with different keyboard shapes, that can adapt to each person real hands sizes for example, etc…

Photo organizer

2008-11-25T00:00:00+00:00

Place this script file in a folder full of pictures, run it, and it will organize the pictures into folders by the day they were taken.

Ruby Version:

Run:

$ ruby picsort.rb

Code:

# filename: picsort.rb
#!/usr/bin/env ruby

require 'fileutils'
require 'exifr'

def picDate(file)
  begin 
    ex = EXIFR::TIFF.new(file) || EXIFR::JPEG.new(file)
    (ex.date_time).strftime "%Y%m%d" if ex.exif?
  rescue
    File.mtime(file).strftime "%Y%m%d"
  end
end

def isPic?(file)
  [".JPG",".JPEG",".PNG",".AVI",".WAV",".NEF",".MOV",".TIFF"]
    .include? File.extname(file).upcase
end

print "Creating dirs"
Dir.foreach(".") do |f|
  dt = picDate f 
  Dir.mkdir dt and print '.' if isPic? f unless File.directory? dt
end

print "\nMoving pics"
Dir.foreach(".") do |f| 
  FileUtils.mv(f, picDate(f)+'/'+f) and print '.' if isPic? f
end
puts

Python Version:

Run:

$ python picsort.py

Code:

# filename: picsort.py
from PIL import Image
from PIL.ExifTags import TAGS
import os
import time
from datetime import datetime

def get_exif(path):
    ret = {}
    i = Image.open(path)
    info = i._getexif()
    for tag, value in info.items():
        decoded = TAGS.get(tag, tag)
        ret[decoded] = value
    return ret

def picDate (path):
    res = ""
    try:
        res = datetime.strptime(get_exif(path)['DateTimeOriginal'], "%Y:%m:%d %H:%M:%S")
    except:
        res = ""
    return res

def isPic (path):
    file, ext = os.path.splitext(path)
    return (ext.upper() in [".JPG",".JPEG",".PNG",".NEF",".TIFF"])

import glob
for file in glob.glob('*'):
    if isPic(file):
        if picDate(file) != "":
            dir = picDate(file).strftime("%Y-%m-%d")
            if not(os.path.exists(dir)):
                os.mkdir(picDate(file).strftime("%Y-%m-%d"))
            os.rename(file, dir + '/' + file)

Stock Price Alert

2008-07-31T00:00:00+00:00

Waiting for a stock price to rise to a certain value to sell it? or waiting for the price to drop to a certain value to buy it? But don’t want to be checking it every day? Here’s a little automation that sends an alert to email inbox.

This script will send you an email when stock price is outside the defined thresholds.

To set it up running every day, use windows Task Scheduler or CRON for linux and mac machines.

tip: on windows set the filename to .rbw (means window mode script), so that no console appears when the scheduled task runs. Beware that if there is an error you wont see it also.

Code:

class Stock
  require 'net/http'
    def self.fetch(*symbols)
        Hash[*(symbols.collect { |symbol| 
        	[symbol, Hash[*(Net::HTTP.get('download.finance.yahoo.com','/d?f=nl1&s='+symbol).chop.split(',').unshift("Name").insert(2,"Price"))]];
        }.flatten)];
    end
end
# puts Stock::fetch("MSFT").inspect 

class Mail
  require 'time'
  require 'net/smtp'
  def self.send(mail)
    msg = "Subject: #{mail[:subject]}\n\n#{mail[:body]}"
    smtp = Net::SMTP.new 'smtp.gmail.com', 587
    smtp.enable_starttls
    smtp.start('gmail.com', mail[:from], mail[:password], :login) do
      smtp.send_message(msg, mail[:from], mail[:to])
    end
  end
end

stock = Stock::fetch("MSFT")["MSFT"]["Price"].to_f
Mail::send({from:     "batatas123@gmail.com", 
            to:       "batatas123@gmail.com", 
            password: "*********", 
            subject:  "MSFT stock",
            body:     stock}) if stock <= 26 or stock > 28

Ebay Misspells Search

2008-06-27T00:00:00+00:00

A small ruby script that does searches in ebay. Is able to:

introducing some misspells in order to find those misspelled items that might be a great bargain(because nobody else is able to find them)
able to search in more than one ebay store.

Prints out to console the search results. Of course because is in ruby, is easy to modify and play around :)

Quick and dirty pulled together, but posting here in case someone else finds it useful.

Use like so:

for multiple keyword search:

> ruby ebayfind.rb "Ibanez guitar"

for single keyword search:

> ruby ebayfind.rb Ibanez

Code:

require 'net/http'
require 'rexml/document'
require 'active_support'


class Misspell

 def self.a(a_text="word", options=["skip_letter", "double_letters",
   "reverse_letters","skip_spaces","missed_key","inserted_key" ])

    ## Sets up the params    params = Hash.new
   params["user_input"] = a_text
   options.each do |opt|
     params[opt] = opt
   end

    #executes a call    
    res = Net::HTTP.post_form(URI.parse('http://tools.seobook.com/spelling/keywords-typos.cgi'), params)

    #cleans and formats results    
    res.body.gsub("\n", ',').scan(/<textarea rows[^>]*(.*)<\/textarea>/).flatten.to_s.gsub(">", "").split(',')
 end
end



if ARGV.length == 0
 puts "#{$0}: You must enter at least one argument."
 exit
end

search_str = [ARGV[0]]

puts "Finding: #{search_str}"
output = ""

puts "Creating the misspells..."
search_str << Misspell.a(search_str, ["skip_letter", "double_letters", "reverse_letters","skip_spaces","missed_key","inserted_key" ])
search_str.flatten!
puts "Search item with added misspells: #{search_str.join(",")}"

#ebay_stores = {"1"=>"US", "3"=>"GB", "77"=>"DE", "71"=>"FR", "186"=>"ES", "146"=>"NL"}
ebay_stores = {"3"=>"GB", "77"=>"DE", "71"=>"FR", "186"=>"ES", "146"=>"NL"}

#Iterate through each Ebay 
ebay_stores.each_key do |siteid|

  # Iterate through each  
  search_str.each do |query_string|

    # Put together an eBay parameter string    
    ebay_params = {
     'callname' =>'FindItemsAdvanced',
     'appid'                       =>'TODO:_YOUR_OWN_EBAY_API_ID',
     'version'                     =>'553',
     'responseencoding'            =>'XML',
     'siteid'                      =>siteid,
     'MessageID                   '=>'',
     'BidCountMax                 '=>'',
     'BidCountMin                 '=>'',
     'CategoryHistogramMaxChildren'=>'',
     'CategoryHistogramMaxParents '=>'',
     'CategoryID                  '=>'',
     'CharityID                   '=>'',
     'Condition                   '=>'',
     'Currency                    '=>'',
     'DescriptionSearch           '=>'',
     'EndTimeFrom                 '=>'',
     'EndTimeTo                   '=>'',
     'ExcludeFlag                 '=>'',
     'FeedbackScoreMax            '=>'',
     'FeedbackScoreMin            '=>'',
     'GroupMaxEntries             '=>'',
     'GroupsMax                   '=>'',
     'IncludeSelector             '=>'',
     'ItemsAvailableTo            '=>'',
     'ItemsLocatedIn              '=>'',
     'ItemSort                    '=>'',
     'ItemType                    '=>'AllItemTypes',
     'MaxDistance                 '=>'',
     'MaxEntries                  '=>'',
     'ModTimeFrom                 '=>'',
     'PageNumber                  '=>'',
     'PaymentMethod               '=>'PayPal',
     'PostalCode                  '=>'',
     'PreferredLocation           '=>'',
     'PriceMax                    '=>'',
     'PriceMin                    '=>'',
     'ProductID                   '=>'',
     'Quantity                    '=>'',
     'QuantityOperator            '=>'',
     'QueryKeywords               '=> (query_string.gsub(' ', '%20')),
     'SearchFlag                  '=>'',
     'SellerBusinessType          '=>'',
     'SellerID                    '=>'',
     'SellerIDExclude             '=>'',
     'SortOrder                   '=>'',
     'StoreName                   '=>'',
     'StoreSearch                 '=>''
   }.map { |key,value| "#{key.strip}=#{value}" unless value.empty? }.join("&").squeeze('&')

    # Ask eBay what it knows about our query_string    
    ebay_response = Net::HTTP.get_response('open.api.ebay.com', '/shopping?' << ebay_params)

    xml = REXML::Document.new(ebay_response.body)

    # Get basic information    
    response3 =  Hash.from_xml(xml.to_s)

   xml.root.elements.each("/FindItemsAdvancedResponse/SearchResult/ItemArray/Item") do |element|
     item =  Hash.from_xml(element.to_s)
     puts ""
     puts ">> Searching for: #{query_string}, in #{bay_stores[siteid]}, got #{response3["FindItemsAdvancedResponse"]["TotalItems"]} results"
     puts item['Item']['Title']
     puts item['Item']['EndTime']
     puts item['Item']['ConvertedCurrentPrice']
     puts item['Item']['GalleryURL']
     puts item['Item']['ListingType']
     puts item['Item']['Condition']
     puts item['Item']['ViewItemURLForNaturalSearch']
   end
 end
end

Funds 'R' US

2008-02-12T00:00:00+00:00

Is it only me, or is a bit annoying that every year around Christmas, Birthday’s, etc, you are kinda forced to go and find some gift and more often than none end up getting useless stuff…

Its still nice to offer little things anyway, a typical valentine’s day gift should not be changed i guess, but from a pragmatic point of view, offering yet another pyjamas is just.. waste…

A common Deja Vu ?: “Its a week to Christmas, i’m on the shopping mall already, i need to buy Andrew a gift, what can i get him?? lets check the bookstore. What could he like… maybe this cars collection book… or this landscapes photo album…

How about if we offer something to Andrew that he really wants/needs? I know that already exists wishlists, amazon, google etc.. have them, and they are very nice.. you can look at Andrew’s wish list he has a book there he really wants, cool lets get that. But how about big things? things that by yourself you cannot offer to Andrew.

Like, Andrew really wants a piano. I cannot offer Andrew a piano, but maybe i can contribute to Andrew’s piano, kinda like buying a couple of piano keys…

Thats where Funds’R’US come in, Andrew would go to Funds’R’US and create there a piano fund for him. When its Christmas time i can go to Funds’R’US and check Andrew’s Funds, and find out that he really wants a piano, so I can online, quickly, without needing to go look for random stuff in shops, give him something useful and that will make Andrew really happy. So i would deposit some money into Andrew’s piano fund and then Funds’R’US could even send a personalized postcard to Andrew with a picture of a piano with a fraction of the picture selected saying: Merry Christmas, Alex. After enough christmas and birthdays passing by, Andrew will eventually get money for a full piano, instead of a pile of not so useful stuff.

Discussing this idea, with a friend of mine, Safin Ahmed (the zen master of (crazy?)ideas :) ) he suggested extending this idea a bit further, actually the Funds’R’US could even support more than personal funds, they could have organizations Funds. Like Unicef wants to offer a health care center to a needing city. Or your own town homeless center supporting more 5 people for next winter. etc.. etc…

Big Brother Google

2008-01-18T00:00:00+00:00

(A)Tipical Day?: Its morning, you get to your computer, go check the gmail, ah there’s some some news fotos on web picasa from a friend, lets check them out…

Then, whats new today? lets check some rss feeds from the google reader and by the way, how is the real world doing? lets check the google news, humm this is interesting, lets do some browsing on it. Open google search …

search, search, search …

Lunch time, check gmail again, then the gtalk inside the browser pop’s up with a friend asking how you doing, you make a little talk.

Lets take also a peek at how the stock options doing on google finance.

search, search, search …

Its late afternoon, you scheduled a dinner and you wonder: exactly where is that restaurant street? lets check google maps, (or google earth for an even cooler experience), you find out exactly where is the restaurant, even add a placemark…

Get back home from dinner, download you digital camera pictures onto computer, upload them into web picasa, make a blog post(on google blogger) showing a picture from that night adding some comments where you where, and what you did, and who you were with.

Check your mail again, and your friend sent you and youtube video, lets check that out, and maybe 1, 2, 3… 10 videos more …

Besides all these you might also use google for making your home page, share documents, share a calendar, google analytics for collecting your web site stats, google desktop search to make fast searches on your computer? google shopping, google images, etc, etc…

So overall, how many google applications are you using? how much information does google has or can potentially have about you? By the way, note that being able to handle this amount of data in a usable way is no easy task but… still we can only imagine …

And actually, from a data mining perspective, this is a dream situation almost, imagine the knowledge collected from all the situation described above; the searches you do, the photos you make, the places you look at in the world, where you go, what images you search about, what you click on, what are your interests, what you blog about, what news you subscribe, maybe some information on what you work on, and maybe some information of your school, with google desktop indexing all documents…its endless, your agenda, your shopping habits, etc, etc…

Soon, Google knows more about you than you do, imagine the ultimate google application, where you type the following into the search box:

Me: Show me what’s new today!

Google(all the news you are interested about):

Video of the new MacBook Air
Ruby 1.9 released
News on Skype new version for mac

Me: Entertain me!

Google: Do you want to?

play a Joe Satriani cd.
watch Seinfeld episode.
Check out a chocolate cake recipe
Buy a chocolate cake

Me: What is my favorite color, movie, drink?

Google: black, simpsons, coffee!

Google: Hey!!

Me: yes?

Google: haven’t you forgot to pay your water bill? And don’t forget that your gradma birthday is in 3 days, go buy something nice!

Me: humm … ops …. yeah thanks

Google: Maybe you want to check out this cooking book(link), 13 recipes contain ingredients your grandma likes… and shipping costs are free to her address area.

Visualizing Data, with Processing and JRuby

2008-01-14T00:00:00+00:00

Here’s a data visualization experiment including a mini data warehouse, to visualize the amount of vegetarians around the word.

I planned the following:

It goes like this, imagine looking at world map, with each country showing the number of vegetarians, you should be able to zoom in to europe for example, to see which country leads vegetarian eating or zoom out and see the whole world picture, click on a specific country and see the statistics for that country for a full month, or choose a particular day of month to visualize the whole world on that day. Also is desired map like navigation, were is possible to drag the map around, zoom in and out.

On the technical side, as i was interested in using Processing framework and because I am a ruby addict, this turned out to be a good excuse to play with jruby.

Part 1 - Aggregating Data

Normally this process involves a lot of work, but i had a shortcut, i was able to collect clean data from another database. I’m interested in the table with vegetarian people. But what to collect? what to summarize/aggregate? what to calculate ?

NOTE: Specify up front what is the goal of the visualization as much as possible, this will influence the way all design will be done. I had to repeat initial steps a couple of times as the visualization ideas developed, like re- agregating all the data and calculate new fields.

So in an warehouse fashion lets choose the measures and dimensions:

measures:

number of vegetarians.

Dimensions:

Time.
Localization(country).

measures: are generally numeric data that captures specific values.

dimensions: contain the reference information that gives each transaction its context. When dimensions are created they should be as enriched with most information as possible (and calculated values).

Next Step is to build the warehouse, for this is used a plain database where i created 3 tables:

Dimension Country:

Initially i only had 2 char ISO code identifying country, but i enriched the dimension with all the other values.

I used geoname.org webservice to collect other values. Specially important are the geo coordinates for the country bounding box which where used to calculate central latitude and a central longitude of a country, that is going to be used for the visualization.

Things like continent, population, capital, are can be used later for summarizing data for continent, for showing ratio of number of vegetarians for total of population, number of vegetarians for square meter, etc, etc… think of the possibilities… :)

Dimension Time:

Made the finest granularity detail as a day. Then from a day, we can calculate, day, month, year, day of week, weekday?, day in year, day in month, quarter, week day name, etc etc…

What is this useful for? Well imagine you want to see number of vegetarians on wendenesday’s compared to monday’s, or the same for quarters, or months, maybe getting close to summer months, the number of veggies might go up a bit ?

Aggregating

With the basic schema laid-out, its time for data collection. I used the ActiveRecord part of the rails framework, using jruby. Its not the first time i’ve used ActiveRecord as standalone and i like it a lot… simplifies data access hugely, and because its all inside ruby, i get the added bonus of doing some calculations that would be much harder in pure sql. These collected and calculated values are then inserted into a local mySql using the schema above: factvegetarian, dimdate and dimcountry.

I’ve collected values for a whole month.

Resulted in 225 lines of code for the warehouse part code, with some comments… but no repeated code.

Part 2 - Building a Visualizer

The visualizer is a cycle that refreshes the interface, on each cycle the database is queried, with a set of filters, like view, date, country. The filters are updated when user clicks on the interface. Like clicking on US, will set the the filter country to US, on the next refresh data for US is obtained and the interface updated accordingly.

Application was divided into different drawing components:

Show World Data, its the opening scenario, showing the whole world for a 1 month’s period.
Show Country, used showing a specific country stats.
Show Stats, a strip at bottom showing a graph of the number of vegetarians per day, over a month’s perdiod.
Show Buttons, button used to control zoom, reset, etc…

(Probably a refactoring will reduce the Show World Map and the Show Country into a single Drawing component, has a lot of repeated code.)

I’ve created a different module for each one, which were then mixed into main class the inherits from Processing.Sketch.

Made some stuff clickable:

country codes, displayed on top of the countries, so the user has the possibility to filter and see stats on bottom of a single country. This is done by identifying which country coordinates is closer to the mouse coordinates.
Also on the bottom, the stats strip has on the x axis the possibility to click on the day of the month, so the user can select a particular day and that will update the world visualization, showing the numbers of the number of vegetarians for a given day for all the world.

And here’s what it looks like:

When Zoomed in, and showing Portugal stats on the bottom:

Ended up with 584 lines of code, with a big chunk of repeated code, on the visualization part.

Overall making the visualization was a lot more work that the warehouse part, because I had a lot of fighting around with correct coordinates positioning, getting a decent map, maintaining map country coordinates with the zooms.

Using jruby was mostly a nice experience, there are a couple of things to learn at first, for example on how to include java libraries, no biggie, but I had also a type conversion issue when i tried to refactor the code at some point, i guess its because of the java type’s, that jruby guys hide and convert automatically … but most likelly its because of my inexperience with jruby…

I’ve used version 1.0 of jruby, i think is a great work that jruby guys have done, making accessible to ruby community all the millions of java libraries out there. But of course don’t expect to do 100% ruby code like you do with old ruby, sometimes there’s some java lurking out of the jruby box.

the Good

Well, its very cool to be able to use ruby for Drawing. Gives power that regular ruby does not have. Exists huge amount of libraries, to use with it. Connection to Java is indeed very powerful.

the Bad

Visualizations are hard, ended up with a lot of repeated code repeated and all over the place. Why? Well partly because im a newbie in jRuby, but partly because Processing seems to fit better for small Sketch visualizations.

Ideas

Is it possible to do a little architecture around it?, to make it a bit better, isolating all drawing stuff.

Processing

Processing is great, has also huge potential, had a couple of troubles with 1 or 2 plugins i tried, but i end up using base distribution and that works and feels 100%. I look forward to do more stuff with it, it is fun!

About

2008-01-01T00:00:00+00:00

Passionate about Data and improving life with Data.

Professional Resume: http://www.linkedin.com/in/al3xandr3

Github Resume: http://resume.github.io/?al3xandr3

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
3	4	.	5	.	6	.	7	1	.	2	.	3	4	.	5
7	1	.	2	.	3	4	.	5	.	6	.	7	1	.	2
5	.	6	.	7	1	.	2	.	3	4	.	5	.	6	.
2	.	3	4	.	5	.	6	.	7	1	.	2	.	3	4
6	.	7	1	.	2	.	3	4	.	5	.	6	.	7	1
3	4	.	5	.	6	.	7	1	.	2	.	3	4	.	5

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
3	4	.	5	.	6	.	7	1	.	2	.	3	4	.	5
7	1	.	2	.	3	4	.	5	.	6	.	7	1	.	2
5	.	6	.	7	1	.	2	.	3	4	.	5	.	6	.
2	.	3	4	.	5	.	6	.	7	1	.	2	.	3	4
6	.	7	1	.	2	.	3	4	.	5	.	6	.	7	1
3	4	.	5	.	6	.	7	1	.	2	.	3	4	.	5

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
3	4	.	5	.	6	.	7	1	.	2	.	3	4	.	5
7	1	.	2	.	3	4	.	5	.	6	.	7	1	.	2
5	.	6	.	7	1	.	2	.	3	4	.	5	.	6	.
2	.	3	4	.	5	.	6	.	7	1	.	2	.	3	4
6	.	7	1	.	2	.	3	4	.	5	.	6	.	7	1
3	4	.	5	.	6	.	7	1	.	2	.	3	4	.	5