Ahmet Gyger's web log

Choosing the right job

Ahmet — Sat, 11 Mar 2023 21:50:27 +0000

“I itch to do something new”, this is how a PM I’m mentoring explained why she is considering new opportunities. “I really didn’t like the people I talked with, and the business is not going to grow further” mentioned a good friend of mine about his recent experience interviewing at a company.

Making a decision is hard. Photo by Andrea Piacquadio.

Changing jobs is a very stressful experience. Unfortunately, many people are now forced to change and choose a new job.

It is important to have a good framework to decide which job is the right one for you at your current stage of your career. Many people optimize for the sake of change or for revenue rather than optimizing for what is best for them. Let me share the framework I have been using for a few years now. This framework is inspired by decision tree algorithms. Decision tree algorithms are used in machine learning (and statistical learning in general), they are a representation of all possible solutions to a decision based on certain conditions. It breaks down a problem into smaller, more manageable components by creating a tree-like model of decisions and their possible consequences. I do not intend to go in depth in decision trees but if you are interested, Harvard Business Review has an excellent article on Decision Trees for Decision-Making (and yes, it is from 1964 so nothing novel here).

The framework, at a high level, is composed of multiple categories, composed of weighted elements. These categories will be unique for you, they should represent everything that matters to you. In each category, you will also have a set of elements. Each element should have a weight representing how valuable or important it is for you. For each job opportunity, you should use a scale (I am using 0 to 10).

This is how it looks like, conceptually:

Conceptual view of the Decision Framework

To make this more concrete, here’s a super simple example with values.

Example of using the Decision Framework

What is important is to find all the categories that matter for you and what are all the elements that compose that category. Set a proper weight for your personal value and find the right value for each element of a job opportunity.

Personally, I have categories like “technology”, “people management”, “location”, “organization”, “customer relationship”, “growth & impact”, “compensation”, “risk”, “work-life harmony”, and “family impact”.

I’m trying to stick to this template to stay consistent in my values for each job opportunity:

0 None
2 A bit
5 Average
7 > Average
10 A lot

Feel free to grab that spreadsheet here and make it your own!

Good luck in your decision and I hope this will be helpful.

Running 1:1s

Ahmet — Mon, 05 Sep 2022 18:50:45 +0000

Two persons discussing during a 1:1. Photo by Jopwell.

It’s part of a manager roles to have weekly 1:1 with people in your team and your peers. In this post, I’ll focus only on the 1:1s with people reporting to you.

Manager Contract

When someone new join my team or when I join a new team, I’m asking a set of questions to my report that are meant to understand how to best help them and how to build a strong relationship.

Contract

Here’s a list of questions I’m asking people in my team to answer of our first official 1:1.

Which areas would you like the most support with?
How would you like to receive feedback and support?
What could be a challenge for us working together?
How might we know if the support I’m offering is not going well?
How confidential is our meeting?
What are the qualities of a perfect manager for you?
What are all the projects you are working on?
Where are you focusing your growth?

Each of these questions aim to identify some key points in the relationship your are building in your team.

Which areas would you like the most support with?
With this question you, as a manager, can identify if there are organizational issue (people asking for support dealing with other people / teams / processes) and if your report knows his/her weakness and areas for growth opportunities.

How would you like to receive feedback and support?
Most of the time I’m hearing that people want constant feedback but here the how is important. Is it during our 1:1’s or is it directly as you see something happening? As a manager, the feedback you provide your team is important for them to know 1) they are on the right track, 2) making progress, 3) you have their back and care about them.

What could be a challenge for us working together?
This question is giving an opportunity to your report to address a potential challenge in your relationship. Maybe you were peer and got promoted instead of them? Maybe you are on a different time zone? Maybe you have significant different culture?
It might also be a moment where your team express what they like about their manager (don’t micro-manage, add more loads than help, …)

How might we know if the support I’m offering is not going well?
It’s critical to have a mechanism where, as a manager, you can validate that you are helping your team positively. It’s important to get feedback from your team, so you can grow as a manager.

How confidential is our meeting?
As a manager, it’s important that you can take actions on some of the conversation you have during your 1:1s. It might be that you figure out that someone is at risk of attrition or someone is not happy in his/her current role. I believe it’s important to be transparent with the team about this and set the right expectations. It’s also important that the team knows they can speak with you without having everything bubbling up. Being explicit during the conversation and asking how confidential they want to keep it is a good mechanism for trust.

What are the qualities of a perfect manager for you?
This is one question were you can learn what is important for your report from his/her manager. Is it all about the career, is it about giving autonomy, or is it about building a sense of belonging?
You can learn a lot from past examples of perfect manager and poor manager as well.

What are all the projects you are working on?
This question is more about learning what your report is working on… you might be surprised to learn they are working on more things that you expected. It’s a great foundation to start learning more about all of these items.

Where are you focusing your growth?
I’m a huge fan of the growth mindset having clear objectives for growth can help building a better team and happier people. When someone don’t have any focus on growth it might be an indicator that some coaching on growth mindset can be applied OR that this person is so busy that s/he can’t focus on growth.

Resources: Become an effective software engineering manager (book)

Beyond the manager contract

Once you have agreement on the manager contract, you can run into your weekly 1:1s.
This is where you can 1) learn insights from your team, 2) identify obstacles, 3) coach.

Insights from your team

As a manager, you want to learn all the insights your report have gathered during the week (see the product strategy management post if you are wondering why). Sources of insights can be quantitative, qualitative, technology, and industry. As a manager, it’s your role to disseminate all the insights to the right people in the organization.

Identify obstacles

There are many different obstacles that can meet your team and where as a manager you can provide coaching and support.

Dependency on another team
Need to acquire a new technology
Customer issues
Single chock point (platform team?)
Senior stakeholder raises concern

Coaching

I strongly recommend this book from Michael Bungay Stanier to learn how to coach: The Coaching Habit – Say less, ask more & change the way you lead forever
The crux of it can be summarized in 7 questions:

What’s on your mind?
Stay focused and open.
And what else?
Help boost the following questions.
What’s the real challenge here for you?
Begins to funnel the topic in a way that focuses the conversation.
What do you want?
It’s the heart of the matter, the foundation question.
How can I help?
We learn what our role should be here.
If you are saying yes to this, what are you saying no to?
Develop the strength of staying curious before committing.
What was the most useful for you?
Learn what was valuable from the coached person.

Keep it written

For all my 1:1s (also with my manager) I’m keeping a document that I increment every week, so I can be reminded of our conversation over time and we can measure our progress as well. It’s also a good mechanism to ensure people can add questions in the document instead of sending you another email / IM.

Anything else that is helpful for your 1:1s?

Product Management 101

Ahmet — Mon, 05 Sep 2022 17:37:59 +0000

The intention for this page is to capture bits and pieces of knowledge on product management…

Product strategy

How do we make the product vision a reality, while meeting the needs of the company as we go. It requires choice, thinking, and effort. Once the product strategy is defined, goals and roadmap need to be aligned.

The biggest challenges with product strategy is to ensure that we are making choices on what is really important. These choices need to be informed from generated insights and help the organization focus and set priorities.

It boils down to 1) focus the organization on a small number of truly important problems, 2) identify key insights, 3) convert the insights into actions in the form of objectives, 4) manage the teams for success (remove obstacles).

Resources:
SVPG – product strategy overview

Insights

4 valuables source of insights: quantitative, qualitative, technology, and industry.

Quantitative: analysis of product data like business model, acquisition funnel, customer retention, etc. Being able to run live-data tests (A/B testing) is an important skill for an organization.
Qualitative: comes from user research, can be very profound even if not statistically significant. Insights can be either evaluative (testing out new ideas) or generative (discovering new opportunities).
Technology: new technology can bring new perspective to solve long-standing problems in a new way. Monitoring the technology space is important to catch these insights.
Industry: what are the major industry trends and insights in other industries that may pertain to your industry.

All generated insights needs to be shared and communicated. This is where leaders become important as a way to aggregate and distribute the insights to the right people. The most important insights can be summarized and shared with the broader organization.

Resources:
SVPG – Product Strategy Insights

Actions

Two approaches to set actions, you can either have a team of mercenaries (executing on your order) or a team of missionaries (executing on a vision). The first team is efficient when you need to add features to a roadmap while the other one is generally better at building the right product.

A system like OKR (Objective and Key Results) is the most popular.

Objective are the customer (or business problem) we need to solve.
Key results: how we measure progress.

Resources:
SVPG – Product Strategy Actions

Management

There are number of issues and obstacles that will emerge, management will need to provide assistance to unblock them and make sure the product team keeps making progress towards their objectives.

Resources:
SVPG – Product Strategy Management

Extracting action and meeting intents from communication

Ahmet — Sun, 27 Dec 2020 03:07:25 +0000

I was recently able to make good progress on a side project that I’ve been working on for some time. The idea is to have a ‘virtual assistant’ that will find action and meeting intents from the different channels of communication I’m using daily, for work or personally.

Being always remote means that the digital communication has increased exponentially, and it can be overwhelming to keep tabs on everything that is going on. The virtual assistant is hopefully going to fill the gap I’ve left in my emails, chats, and comments.

The idea is to use a language model to identify intents from communication, it appears to be working pretty well so far, even without doing any fine-tuning.

Below are some examples I captured from the demo I’ve put together.

As you can see in the examples above, the tool highlights the sentences that match one of the defined intents (action, meeting, or task completion) and displays all the sentences with a confidence score above 2.

If you are interested in trying it for yourself, I have a page up and running in my lab: http://metah.ch/up/home.html I’d love to get your feedback and if you notice anything that is not correctly inferred.

Next step

Now that I have the end-to-end working, my goal is to enable the functionality directly into some email program (like Gmail or Outlook) using add-ons.

If you would like to stay informed about the progress, please share your details below:

May 2020: Conversational AI – Research Papers

Ahmet — Mon, 11 May 2020 04:59:03 +0000

I hope to keep this page up to date with the latest published papers that I found interesting on the topic of conversational AI, natural language processing (NLP), and knowledge extraction. This is only for the month of May, here’s the whole list on Conversational AI research papers.

Fact-based Dialogue Generation with Convergent and Divergent Decoding

Fact-based dialogue generation is a task of generating a human-like response based on both dialogue context and factual texts. Various methods were proposed to focus on generating informative words that contain facts effectively. However, previous works implicitly assume a topic to be kept on a dialogue and usually converse passively, therefore the systems have a difficulty to generate diverse responses that provide meaningful information proactively. This paper proposes an end-to-end fact-based dialogue system augmented with the ability of convergent and divergent thinking over both context and facts, which can converse about the current topic or introduce a new topic. Specifically, our model incorporates a novel convergent and divergent decoding that can generate informative and diverse responses considering not only given inputs (context and facts) but also inputs-related topics. Both automatic and human evaluation results on DSTC7 dataset show that our model significantly outperforms state-of-the-art baselines, indicating that our model can generate more appropriate, informative, and diverse responses.

Adversarial NLI: A New Benchmark for Natural Language Understanding

We introduce a new large-scale NLI benchmark dataset, collected via an iterative, adversarial human-and-model-in-the-loop procedure. We show that training models on this new dataset leads to state-of-the-art performance on a variety of popular NLI benchmarks, while posing a more difficult challenge with its new test set. Our analysis sheds light on the shortcomings of current state-of-the-art models, and shows that non-expert annotators are successful at finding their weaknesses. The data collection method can be applied in a never-ending learning scenario, becoming a moving target for NLU, rather than a static benchmark that will quickly saturate.

Graph-Embedding Empowered Entity Retrieval

In this research, we improve upon the current state of the art in entity retrieval by re-ranking the result list using graph embeddings. The paper shows that graph embeddings are useful for entity-oriented search tasks. We demonstrate empirically that encoding information from the knowledge graph into (graph) embeddings contributes to a higher increase in effectiveness of entity retrieval results than using plain word embeddings. We analyze the impact of the accuracy of the entity linker on the overall retrieval effectiveness. Our analysis further deploys the cluster hypothesis to explain the observed advantages of graph embeddings over the more widely used word embeddings, for user tasks involving ranking entities.

Efficient Dialogue State Tracking by Selectively Overwriting Memory

Recent works in dialogue state tracking (DST) focus on an open vocabulary-based setting to resolve scalability and generalization issues of the predefined ontology-based approaches. However, they are inefficient in that they predict the dialogue state at every turn from scratch. Here, we consider dialogue state as an explicit fixed-sized memory and propose a selectively overwriting mechanism for more efficient DST. This mechanism consists of two steps: (1) predicting state operation on each of the memory slots, and (2) overwriting the memory with new values, of which only a few are generated according to the predicted state operations. Our method decomposes DST into two sub-tasks and guides the decoder to focus only on one of the tasks, thus reducing the burden of the decoder. This enhances the effectiveness of training and DST performance. Our SOM-DST (Selectively Overwriting Memory for Dialogue State Tracking) model achieves state-of-the-art joint goal accuracy with 51.72% in MultiWOZ 2.0 and 53.01% in MultiWOZ 2.1 in an open vocabulary-based DST setting. In addition, we analyze the accuracy gaps between the current and the ground truth-given situations and suggest that it is a promising direction to improve state operation prediction to boost the DST performance.

An Imitation Game for Learning Semantic Parsers from User Interaction

Despite the widely successful applications, bootstrapping and fine-tuning semantic parsers are still a tedious process with challenges such as costly data annotation and privacy risks. In this paper, we suggest an alternative, human-in-the-loop methodology for learning semantic parsers directly from users. A semantic parser should be introspective of its uncertainties and prompt for user demonstration when uncertain. In doing so it also gets to imitate the user behavior and continue improving itself autonomously with the hope that eventually it may become as good as the user in interpreting their questions. To combat the sparsity of demonstration, we propose a novel annotation-efficient imitation learning algorithm, which iteratively collects new datasets by mixing demonstrated states and confident predictions and re-trains the semantic parser in a Dataset Aggregation fashion (Ross et al., 2011). We provide a theoretical analysis of its cost bound and also empirically demonstrate its promising performance on the text-to-SQL problem.

Birds have four legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-trained Language Models

Recent works show that pre-trained masked language models, such as BERT, possess certain linguistic and commonsense knowledge. However, it remains to be seen what types of commonsense knowledge these models have access to. In this vein, we propose to study whether numerical commonsense knowledge — commonsense knowledge that provides an understanding of the numeric relation between entities — can be induced from pre-trained masked language models and to what extent is this access to knowledge robust against adversarial examples? To study this, we introduce a probing task with a diagnostic dataset, NumerSense, containing 3,145 masked-word-prediction probes. Surprisingly, our experiments and analysis reveal that: (1) BERT and its stronger variant RoBERTa perform poorly on our dataset prior to any fine-tuning; (2) fine-tuning with distant supervision does improve performance; (3) the best distantly supervised model still performs poorly when compared to humans (47.8% vs 96.3%).

SEEK: Segmented Embedding of Knowledge Graphs

In recent years, knowledge graph embedding becomes a pretty hot research topic of artificial intelligence and plays increasingly vital roles in various downstream applications, such as recommendation and question answering. However, existing methods for knowledge graph embedding can not make a proper trade-off between the model complexity and the model expressiveness, which makes them still far from satisfactory. To mitigate this problem, we propose a lightweight modeling framework that can achieve highly competitive relational expressiveness without increasing the model complexity. Our framework focuses on the design of scoring functions and highlights two critical characteristics: 1) facilitating sufficient feature interactions; 2) preserving both symmetry and antisymmetry properties of relations. It is noteworthy that owing to the general and elegant design of scoring functions, our framework can incorporate many famous existing methods as special cases. Moreover, extensive experiments on public benchmarks demonstrate the efficiency and effectiveness of our framework.

Predicting Performance for Natural Language Processing Tasks

Given the complexity of combinations of tasks, languages, and domains in natural language processing (NLP) research, it is computationally prohibitive to exhaustively test newly proposed models on each possible experimental setting. In this work, we attempt to explore the possibility of gaining plausible judgments of how well an NLP model can perform under an experimental setting, without actually training or testing the model. To do so, we build regression models to predict the evaluation score of an NLP experiment given the experimental settings as input. Experimenting on 9 different NLP tasks, we find that our predictors can produce meaningful predictions over unseen languages and different modeling architectures, outperforming reasonable baselines as well as human experts. Going further, we outline how our predictor can be used to find a small subset of representative experiments that should be run in order to obtain plausible predictions for all other experimental settings.

DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation

We present a large, tunable neural conversational response generation model, DialoGPT (dialogue generative pre-trained transformer). Trained on 147M conversation-like exchanges extracted from Reddit comment chains over a period spanning from 2005 through 2017, DialoGPT extends the Hugging Face PyTorch transformer to attain a performance close to human both in terms of automatic and human evaluation in single-turn dialogue settings. We show that conversational systems that leverage DialoGPT generate more relevant, contentful and context-consistent responses than strong baseline systems. The pre-trained model and training pipeline are publicly released to facilitate research into neural response generation and the development of more intelligent open-domain dialogue systems.

Self-supervised Knowledge Triplet Learning for Zero-shot Question Answering

The aim of all Question Answering (QA) systems is to be able to generalize to unseen questions. Most of the current methods rely on learning every possible scenario which is reliant on expensive data annotation. Moreover, such annotations can introduce unintended bias which makes systems focus more on the bias than the actual task. In this work, we propose Knowledge Triplet Learning, a self-supervised task over knowledge graphs. We propose methods of how to use such a model to perform zero-shot QA and our experiments show considerable improvements over large pre-trained generative models.

Cross-lingual Entity Alignment for Knowledge Graphs with Incidental Supervision from Free Text

Much research effort has been put to multilingual knowledge graph (KG) embedding methods to address the entity alignment task, which seeks to match entities in different languagespecific KGs that refer to the same real-world object. Such methods are often hindered by the insufficiency of seed alignment provided between KGs. Therefore, we propose a new model, JEANS , which jointly represents multilingual KGs and text corpora in a shared embedding scheme, and seeks to improve entity alignment with incidental supervision signals from text. JEANS first deploys an entity grounding process to combine each KG with the monolingual text corpus. Then, two learning processes are conducted: (i) an embedding learning process to encode the KG and text of each language in one embedding space, and (ii) a self-learning based alignment learning process to iteratively induce the correspondence of entities and that of lexemes between embeddings. Experiments on benchmark datasets show that JEANS leads to promising improvement on entity alignment with incidental supervision, and significantly outperforms state-of-the-art methods that solely rely on internal information of KGs.

Contextualized Sparse Representations for Real-Time Open-Domain Question Answering

Open-domain question answering can be formulated as a phrase retrieval problem, in which we can expect huge scalability and speed benefit but often suffer from low accuracy due to the limitation of existing phrase representation models. In this paper, we aim to improve the quality of each phrase embedding by augmenting it with a contextualized sparse representation (Sparc). Unlike previous sparse vectors that are term-frequency-based (e.g., tf-idf) or directly learned (only few thousand dimensions), we leverage rectified self-attention to indirectly learn sparse vectors in n-gram vocabulary space. By augmenting the previous phrase retrieval model (Seo et al., 2019) with Sparc, we show 4%+ improvement in CuratedTREC and SQuAD-Open. Our CuratedTREC score is even better than the best known retrieve & read model with at least 45x faster inference speed.

Unsupervised Domain Clusters in Pretrained Language Models

The notion of “in-domain data” in NLP is often over-simplistic and vague, as textual data varies in many nuanced linguistic aspects such as topic, style or level of formality. In addition, domain labels are many times unavailable, making it challenging to build domain-specific systems. We show that massive pre-trained language models implicitly learn sentence representations that cluster by domains without supervision — suggesting a simple data-driven definition of domains in textual data. We harness this property and propose domain data selection methods based on such models, which require only a small set of in-domain monolingual data. We evaluate our data selection methods for neural machine translation across five diverse domains, where they outperform an established approach as measured by both BLEU and by precision and recall of sentence selection with respect to an oracle.

Conversational AI – Research Papers

Ahmet — Fri, 13 Dec 2019 05:57:49 +0000

I hope to keep this page up to date with the latest published papers that I found interesting on the topic of conversational AI, natural language processing (NLP), and knowledge extraction.

May 2020

Fact-based Dialogue Generation with Convergent and Divergent Decoding

Adversarial NLI: A New Benchmark for Natural Language Understanding

Graph-Embedding Empowered Entity Retrieval

Efficient Dialogue State Tracking by Selectively Overwriting Memory

An Imitation Game for Learning Semantic Parsers from User Interaction

Birds have four legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-trained Language Models

SEEK: Segmented Embedding of Knowledge Graphs

Predicting Performance for Natural Language Processing Tasks

DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation

Self-supervised Knowledge Triplet Learning for Zero-shot Question Answering

Cross-lingual Entity Alignment for Knowledge Graphs with Incidental Supervision from Free Text

Contextualized Sparse Representations for Real-Time Open-Domain Question Answering

Unsupervised Domain Clusters in Pretrained Language Models

Apr 2020

Longformer: The Long-Document Transformer

Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer. Longformer’s attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA.

Named Entity Recognition without Labelled Data: A Weak Supervision Approach

Named Entity Recognition (NER) performance often degrades rapidly when applied to target domains that differ from the texts observed during training. When in-domain labelled data is available, transfer learning techniques can be used to adapt existing NER models to the target domain. But what should one do when there is no hand-labelled data for the target domain? This paper presents a simple but powerful approach to learn NER models in the absence of labelled data through weak supervision. The approach relies on a broad spectrum of labelling functions to automatically annotate texts from the target domain. These annotations are then merged together using a hidden Markov model which captures the varying accuracies and confusions of the labelling functions. A sequence labelling model can finally be trained on the basis of this unified annotation. We evaluate the approach on two English datasets (CoNLL 2003 and news articles from Reuters and Bloomberg) and demonstrate an improvement of about 7 percentage points in entity-level F1 scores compared to an out-of-domain neural NER model.

ToD-BERT: Pre-trained Natural Language Understanding for Task-Oriented Dialogues

The use of pre-trained language models has emerged as a promising direction for improving dialogue systems. However, the underlying difference of linguistic patterns between conversational data and general text makes the existing pre-trained language models not as effective as they have been shown to be. Recently, there are some pre-training approaches based on open-domain dialogues, leveraging large-scale social media data such as Twitter or Reddit. Pre-training for task-oriented dialogues, on the other hand, is rarely discussed because of the long-standing and crucial data scarcity problem. In this work, we combine nine English-based, human-human, multi-turn and publicly available task-oriented dialogue datasets to conduct language model pre-training. The experimental results show that our pre-trained task-oriented dialogue BERT (ToD-BERT) surpasses BERT and other strong baselines in four downstream task-oriented dialogue applications, including intention detection, dialogue state tracking, dialogue act prediction, and response selection. Moreover, in the simulated limited data experiments, we show that ToD-BERT has stronger few-shot capacity that can mitigate the data scarcity problem in task-oriented dialogues.

Generate, Delete and Rewrite: A Three-Stage Framework for Improving Persona Consistency of Dialogue Generation

Maintaining a consistent personality in conversations is quite natural for human beings, but is still a non-trivial task for machines. The persona-based dialogue generation task is thus introduced to tackle the personality-inconsistent problem by incorporating explicit persona text into dialogue generation models. Despite the success of existing persona-based models on generating human-like responses, their one-stage decoding framework can hardly avoid the generation of inconsistent persona words. In this work, we introduce a three-stage framework that employs a generate-delete-rewrite mechanism to delete inconsistent words from a generated response prototype and further rewrite it to a personality-consistent one. We carry out evaluations by both human and automatic metrics. Experiments on the Persona-Chat dataset show that our approach achieves good performance.

Feb 2020

Improving Multi-Turn Response Selection Models with Complementary Last-Utterance Selection by Instance Weighting

Open-domain retrieval-based dialogue systems require a considerable amount of training data to learn their parameters. However, in practice, the negative samples of training data are usually selected from an unannotated conversation data set at random. The generated training data is likely to contain noise and affect the performance of the response selection models. To address this difficulty, we consider utilizing the underlying correlation in the data resource itself to derive different kinds of supervision signals and reduce the influence of noisy data. More specially, we consider a main-complementary task pair. The main task (\ie our focus) selects the correct response given the last utterance and context, and the complementary task selects the last utterance given the response and context. The key point is that the output of the complementary task is used to set instance weights for the main task. We conduct extensive experiments in two public datasets and obtain significant improvement in both datasets. We also investigate the variant of our approach in multiple aspects, and the results have verified the effectiveness of our approach.

Sequential Latent Knowledge Selection for Knowledge-Grounded Dialogue

Knowledge-grounded dialogue is a task of generating an informative response based on both discourse context and external knowledge. As we focus on better modeling the knowledge selection in the multi-turn knowledge-grounded dialogue, we propose a sequential latent variable model as the first approach to this matter. The model named sequential knowledge transformer (SKT) can keep track of the prior and posterior distribution over knowledge; as a result, it can not only reduce the ambiguity caused from the diversity in knowledge selection of conversation but also better leverage the response information for proper choice of knowledge. Our experimental results show that the proposed model improves the knowledge selection accuracy and subsequently the performance of utterance generation. We achieve the new state-of-the-art performance on Wizard of Wikipedia (Dinan et al., 2019) as one of the most large-scale and challenging benchmarks. We further validate the effectiveness of our model over existing conversation methods in another knowledge-based dialogue Holl-E dataset (Moghe et al., 2018).

ConvLab-2: An Open-Source Toolkit for Building, Evaluating, and Diagnosing Dialogue Systems

We present ConvLab-2, an open-source toolkit that enables researchers to build task-oriented dialogue systems with state-of-the-art models, perform an end-to-end evaluation, and diagnose the weakness of systems. As the successor of ConvLab (Lee et al., 2019b), ConvLab-2 inherits ConvLab’s framework but integrates more powerful dialogue models and supports more datasets. Besides, we have developed an analysis tool and an interactive tool to assist researchers in diagnosing dialogue systems. The analysis tool presents rich statistics and summarizes common mistakes from simulated dialogues, which facilitates error analysis and system improvement. The interactive tool provides a user interface that allows developers to diagnose an assembled dialogue system by interacting with the system and modifying the output of each system component. Code.

Dec 2019

Neural Module Networks for Reasoning over Text

Answering compositional questions that require multiple steps of reasoning against text is challenging, especially when they involve discrete, symbolic operations. Neural module networks (NMNs) learn to parse such questions as executable programs composed of learnable modules, performing well on synthetic visual QA domains. However, we find that it is challenging to learn these models for non-synthetic questions on open-domain text, where a model needs to deal with the diversity of natural language and perform a broader range of reasoning. We extend NMNs by: (a) introducing modules that reason over a paragraph of text, performing symbolic reasoning (such as arithmetic, sorting, counting) over numbers and dates in a probabilistic and differentiable manner; and (b) proposing an unsupervised auxiliary loss to help extract arguments associated with the events in text. Additionally, we show that a limited amount of heuristically-obtained question program and intermediate module output supervision provides sufficient inductive bias for accurate learning. Our proposed model significantly outperforms state-of-the-art models on a subset of the DROP dataset that poses a variety of reasoning challenges that are covered by our modules.

Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach

Zero-shot text classification (0Shot-TC) is a challenging NLU problem to which little attention has been paid by the research community. 0Shot-TC aims to associate an appropriate label with a piece of text, irrespective of the text domain and the aspect (e.g., topic, emotion, event, etc.) described by the label. And there are only a few articles studying 0Shot-TC, all focusing only on topical categorization which, we argue, is just the tip of the iceberg in 0Shot-TC. In addition, the chaotic experiments in literature make no uniform comparison, which blurs the progress.

Cross-Lingual Ability of Multilingual BERT: An Empirical Study

Recent work has exhibited the surprising cross-lingual abilities of multilingual BERT (M-BERT) — surprising since it is trained without any cross-lingual objective and with no aligned data. In this work, we provide a comprehensive study of the contribution of different components in M-BERT to its cross-lingual ability. We study the impact of linguistic properties of the languages, the architecture of the model, and the learning objectives. The experimental study is done in the context of three typologically different languages — Spanish, Hindi, and Russian — and using two conceptually different NLP tasks, textual entailment and named entity recognition. Among our key conclusions is the fact that the lexical overlap between languages plays a negligible role in the cross-lingual success, while the depth of the network is an integral part of it.

Libri-Light: A Benchmark for ASR with Limited or No Supervision

We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source audio books from the LibriVox project. It contains over 60K hours of audio, which is, to our knowledge, the largest freely-available corpus of speech. The audio has been segmented using voice activity detection and is tagged with SNR, speaker ID and genre descriptions. Additionally, we provide baseline systems and evaluation metrics working under three settings: (1) the zero resource/unsupervised setting (ABX), (2) the semi-supervised setting (PER, CER) and (3) the distant supervision setting (WER). Settings (2) and (3) use limited textual resources (10 minutes to 10 hours) aligned with the speech. Setting (3) uses large amounts of unaligned text. They are evaluated on the standard LibriSpeech dev and test sets for comparison with the supervised state-of-the-art.

Extending Machine Language Models toward Human-Level Language Understanding

Language is central to human intelligence. We review recent breakthroughs in machine language processing and consider what remains to be achieved. Recent approaches rely on domain general principles of learning and representation captured in artificial neural networks. Most current models, however, focus too closely on language itself. In humans, language is part of a larger system for acquiring, representing, and communicating about objects and situations in the physical and social world, and future machine language models should emulate such a system. We describe existing machine models linking language to concrete situations, and point toward extensions to address more abstract cases. Human language processing exploits complementary learning systems, including a deep neural network-like learning system that learns gradually as machine systems do, as well as a fast-learning system that supports learning new information quickly. Adding such a system to machine language models will be an important further step toward truly human-like language understanding.

How to Evaluate the Next System: Automatic Dialogue Evaluation from
the Perspective of Continual Learning

Automatic dialogue evaluation plays a crucial role in open-domain dialogue research. Previous works train neural networks with limited annotation for conducting automatic dialogue evaluation, which would naturally affect the evaluation fairness as dialogue systems close to the scope of training corpus would have more preference than the other ones. In this paper, we study alleviating this problem from the perspective of continual learning: given an existing neural dialogue evaluator and the next system to be evaluated, we fine-tune the learned neural evaluator by selectively forgetting/updating its parameters, to jointly fit dialogue systems have been and will be evaluated. Our motivation is to seek for a lifelong and low-cost automatic evaluation for dialogue systems, rather than to reconstruct the evaluator over and over again. Experimental results show that our continual evaluator achieves comparable performance with reconstructing new evaluators, while requires significantly lower resources.

Nov 2019

Learning to Retrieve Reasoning Paths over Wikipedia Graph for Question Answering

Answering questions that require multi-hop reasoning at web-scale necessitates retrieving multiple evidence documents, one of which often has little lexical or semantic relationship to the question. This paper introduces a new graph-based recurrent retrieval approach that learns to retrieve reasoning paths over the Wikipedia graph to answer multi-hop open-domain questions. Our retriever model trains a recurrent neural network that learns to sequentially retrieve evidence paragraphs in the reasoning path by conditioning on the previously retrieved documents. Our reader model ranks the reasoning paths and extracts the answer span included in the best reasoning path. Experimental results show state-of-the-art results in three open-domain QA datasets, showcasing the effectiveness and robustness of our method. Notably, our method achieves significant improvement in HotpotQA, outperforming the previous best model by more than 14 points.

Sept 2019

End-to-end Named Entity Recognition and Relation Extraction using Pre-trained Language Models

Named entity recognition (NER) and relation extraction (RE) are two important tasks in information extraction and retrieval (IE \& IR). Recent work has demonstrated that it is beneficial to learn these tasks jointly, which avoids the propagation of error inherent in pipeline-based systems and improves performance. However, state-of-the-art joint models typically rely on external natural language processing (NLP) tools, such as dependency parsers, limiting their usefulness to domains (e.g. news) where those tools perform well. The few neural, end-to-end models that have been proposed are trained almost completely from scratch. In this paper, we propose a neural, end-to-end model for jointly extracting entities and their relations which does not rely on external NLP tools and which integrates a large, pre-trained language model. Because the bulk of our model’s parameters are pre-trained and we eschew recurrence for self-attention, our model is fast to train. On 5 datasets across 3 domains, our model matches or exceeds state-of-the-art performance, sometimes by a large margin.

July 2019

Towards Universal Dialogue Act Tagging for Task-Oriented Dialogues

Machine learning approaches for building task-oriented dialogue systems require large conversational datasets with labels to train on. We are interested in building task-oriented dialogue systems from human-human conversations, which may be available in ample amounts in existing customer care center logs or can be collected from crowd workers. Annotating these datasets can be prohibitively expensive. Recently multiple annotated task-oriented human-machine dialogue datasets have been released, however their annotation schema varies across different collections, even for well-defined categories such as dialogue acts (DAs). We propose a Universal DA schema for task-oriented dialogues and align existing annotated datasets with our schema. Our aim is to train a Universal DA tagger (U-DAT) for task-oriented dialogues and use it for tagging human-human conversations. We investigate multiple datasets, propose manual and automated approaches for aligning the different schema, and present results on a target corpus of human-human dialogues. In unsupervised learning experiments we achieve an F1 score of 54.1% on system turns in human-human dialogues. In a semi-supervised setup, the F1 score increases to 57.7% which would otherwise require at least 1.7K manually annotated turns. For new domains, we show further improvements when unlabeled or labeled target domain data is available.

May 2019

Transferable Multi-Domain State Generator for Task-Oriented Dialogue Systems

Over-dependence on domain ontology and lack of knowledge sharing across domains are two practical and yet less studied problems of dialogue state tracking. Existing approaches generally fall short in tracking unknown slot values during inference and often have difficulties in adapting to new domains. In this paper, we propose a Transferable Dialogue State Generator (TRADE) that generates dialogue states from utterances using a copy mechanism, facilitating knowledge transfer when predicting (domain, slot, value) triplets not encountered during training. Our model is composed of an utterance encoder, a slot gate, and a state generator, which are shared across domains. Empirical results demonstrate that TRADE achieves state-of-the-art joint goal accuracy of 48.62% for the five domains of MultiWOZ, a human-human dialogue dataset. In addition, we show its transferring ability by simulating zero-shot and few-shot dialogue state tracking for unseen domains. TRADE achieves 60.58% joint goal accuracy in one of the zero-shot domains, and is able to adapt to few-shot cases without forgetting already trained domains. Code.

April 2019

Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings

Despite advances in open-domain dialogue systems, automatic evaluation of such systems is still a challenging problem. Traditional reference-based metrics such as BLEU are ineffective because there could be many valid responses for a given context that share no common words with reference responses. A recent work proposed Referenced metric and Unreferenced metric Blended Evaluation Routine (RUBER) to combine a learning-based metric, which predicts relatedness between a generated response and a given query, with reference-based metric; it showed high correlation with human judgments. In this paper, we explore using contextualized word embeddings to compute more accurate relatedness scores, thus better evaluation metrics. Experiments show that our evaluation metrics outperform RUBER, which is trained on static embeddings.

March 2019

Privacy-preserving Active Learning on Sensitive Data for User Intent Classification

Active learning holds promise of significantly reducing data annotation costs while maintaining reasonable model performance. However, it requires sending data to annotators for labeling. This presents a possible privacy leak when the training set includes sensitive user data. In this paper, we describe an approach for carrying out privacy preserving active learning with quantifiable guarantees. We evaluate our approach by showing the tradeoff between privacy, utility and annotation budget on a binary classification task in a active learning setting.

Earlier research papers

RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems

Open-domain human-computer conversation has been attracting increasing attention over the past few years. However, there does not exist a standard automatic evaluation metric for open-domain dialog systems; researchers usually resort to human annotation for model evaluation, which is time- and labor-intensive. In this paper, we propose RUBER, a Referenced metric and Unreferenced metric Blended Evaluation Routine, which evaluates a reply by taking into consideration both a groundtruth reply and a query (previous user-issued utterance). Our metric is learnable, but its training does not require labels of human satisfaction. Hence, RUBER is flexible and extensible to different datasets and languages. Experiments on both retrieval and generative dialog systems show that RUBER has a high correlation with human annotation.

Display GitHub files on WordPress

Ahmet — Sat, 06 Jan 2018 18:01:53 +0000

I’ve been struggling a bit to find a good way to display files hosted on GitHub on my blog directly. I didn’t like that most plugins uses Gist which doesn’t allow for a nice collaboration. I have received some good recommendation about improving codes that I shared some time ago. I’d prefer a system where people can push a PR directly to my Git repo and, if I merge, the code is directly updated on the blog.

Finally, I found a service that allows me to do just that: http://gist-it.appspot.com/.

To use it, it’s as simple as adding a link to your GitHub file, as below (where $file is the link to the GitHub file).

Hello 2018! A look back at my learnings from 2017 to start the year.

Ahmet — Wed, 03 Jan 2018 15:54:29 +0000

Happy new year (if you are on the Gregorian calendar)!!! I’ve been willing to write a post for a while but so much have been going on in the last year that I couldn’t prioritize it. The value of having a blog for me was always to share knowledge with fellows on the web. I’ve been learning so much from other blogs and web resources, that’s my way to give back to the community. Blogging is also an easy way for me to “commit” to memory certain things, like summarizing book or articles.

A bit of context first!

In 2016, my family and I moved from Seattle to Boston. I stayed at Microsoft but moved from the Bing team to the Azure Machine Learning team that had an opening in Cambridge, MA. We stayed in Seattle for about 5 years, we had a really great time, made great friends, and we got to learn what it meant to live on Microsoft mothership.

A few months after joining the team in Cambridge, the group decided to update our strategy with regards to Azure Machine Learning and we started working on a set of new capabilities to offer data scientist to make their work more efficient. I’ve also started a certificate in Data Science as I felt that my knowledge of Machine Learning and AI wasn’t academic enough to understand deeply the challenges of data scientist. Combining the new product and the certificate (and the normal life of a dad with 2 kids) had a tax on some side activities, like this blog.

Learning from 2017?

There were many learnings this year, both from an human standpoint and from a professional standpoint.

Have a north star, stick to it but be open to opportunity / feedback.

When building a new product (a v1), it’s really important to have a vision of what values your product is going to bring to your users. We started by interviewing a large amount of data scientist within and outside Microsoft to understand what their daily job was and what their challenges were. A few, pretty clear, patterns emerged quickly and we decided to focus on it. However, we couldn’t do everything in a reasonable time, so we had to take some shortcuts and make some cuts too. We had a few potential users that we could talk to and get their feedback on our plan but without a pool of active users, it’s challenging to have enough data (statistically significant) before making a decision.

Learnings:

Have a network of potential users, know how to reach out to them and talk with them.
Find patterns in all the discussions you have with your users, be really articulated in the process of asking them questions.
Build a product vision, try to get some customers to love it and agree to give feedback along the journey.
Creating a product is both about coming with the right vision and entering the market early enough, need to balance these when making decision.
When building the product, do not forget the destination envisioned but do not be blinded by it either. The goal is to build enough of a product for our users to give us feedback (MVP), if there is a clear trend in the feedback, listen to it.

Doing classes in addition of work requires effort and time

That might sound obvious, but the certificate I’m pursuing requires effort and time. Effort to understand and time for the homework and reading required. I took two classes in 2017 (one per trimester), each class required about 10 hours per week of extra work. Given that we were really busy at work, I already had pretty long days. I’m really lucky that my wife was really supportive in this initiative, she made sure I had enough time for my class while prepping the kids of seeing dad working late and somewhat on their time with me.

Learnings:

Have your family support, you are going to miss a lot of family time working, so make sure to send the right message.
Be organized, I would generally watch the online class as early as possible and then take the evening of the weeks to complete the homework. Morning of the week-end were dedicated to that too.
Get enough rest, altogether my weeks were about 60+ hours of work, it’s dangerously close (for me at least) to burning out, making sure to have enough time to charge your battery while spending quality time with your love ones is a great way to keep your motivation high and your spirit up.

Gives yourself a stretch goal

When joining my new team, I was asked to give a 5 minutes presentation weekly about how the product is going (metrics analysis), it helped me build my “public speaking” skill, then I had the opportunities (I actually actively seek them) to present in conferences ( I gave 5 talks this year, in front of hundreds of people), the first one was terrible but very valuable, then it just got better. It was a stretch goal for me as I used to get a bit anxious before speaking to large amount of people.

Learnings:

Public speaking is challenging, the best way to master it is to prepare for each talk a lot and make as much talk as possible.
Know your subject inside-out, this will help reduce the stress.
Have the right tool for the presentation (adaptors, charger, slides, zoom, …) whatever helps making your presentation better to watch.
People wants you to succeed and are there to learn something.

Thanks for reading and, again, happy new year!

Azure Machine Learning Deployment at Scale Using ARM and AMLPS.

Ahmet — Fri, 09 Sep 2016 13:51:43 +0000

Introduction

In this post, I will demonstrate how to do a simple but useful scenario when managing Azure Machine Learning Workspaces and Experiments, copying all the experiments under one workspace, deploy a new workspace using ARM (Azure Resource Manager) in another region, and then copy the experiment under the newly deployed workspace.

Preparation

You will need to install a couple of things:

Azure Resource Manager PowerShell modules

Azure Service Management PowerShell modules

Azure Machine Learning PowerShell modules (via github).

[powershell]
# Install the Azure Resource Manager modules from the PowerShell Gallery
Install-Module AzureRM -Scope CurrentUser

# Install the Azure Service Management modules from the PowerShell Gallery
Install-Module Azure -Scope CurrentUser
[/powershell]
Installing AMLPS, is not yet as easy but it is not too hard :).

Download the latest zip file from https://github.com/hning86/azuremlps/releases (as of today, the beta is 0.2.8), unzip to a folder (let’s say c:\amlps) – this is a manual step.

Navigate to c:\amlps

Unblock the file

Load the module

[powershell]
#Unblock the downloaded dll file so Windows can trust it.
Unblock-File .\AzureMLPS.dll

#import the PowerShell module into current session
Import-Module .\AzureMLPS.dll
[/powershell]

Of course, you will also need to have an Azure Machine Learning account.

Configure

Now that all the modules are installed and imported, let’s configure our session.
First making sure we authenticate to Azure RM and SM and can list our workspaces.
[powershell]
# Authenticate (enter your credentials in the pop-up window)
Login-AzureRmAccount

# List all workspaces
Get-AzureRmResource |? { $_.ResourceType -Like "*MachineLearning/workspaces*"}
[/powershell]
At this point, you should see all your workspaces listed.
We are now ready to configure AMLPS, for this you will need to retrieve the workspace ID, and the authorization token (detailed step by step for AMLPS configuration). We will update config.json (in c:\amlps) for simplicity. From the above list of workspaces, select the one you want to duplicate. In my example, the workspace name is “workspaceus”.
[powershell]
# Select workspace with the name “workspaceus”
$wsp = Get-AzureRmResource |? { $_.Name -Like "workspaceus"}

# Get the workspaceId
$wid = (Get-AzureRmResource -Name $wsp.Name -ResourceGroupName $wsp.ResourceGroupName -ResourceType $wsp.ResourceType -ApiVersion 2016-04-01).Properties.workspaceId

# Get the primary token
$wpt = (Invoke-AzureRmResourceAction -ResourceId $wsp.ResourceId -Action listworkspacekeys -Force).primaryToken

# Get the location of the workspace
$wil = (Get-AzureRmResource -Name $wsp.Name -ResourceGroupName $wsp.ResourceGroupName -ResourceType $wsp.ResourceType -ApiVersion 2016-04-01).Location

Deploy

Now that we have installed and configured all the tools, we can start our example. We will copy a workspace and its experiment located in “South Central US” and copy it to “West Europe”. Below is an illustration representing the different steps in our journey.

Get workspace information

We will export all experiment graph as a JSON file so we can import them back on the new workspace.
[powershell]
# Create folder for export
New-Item -Name "Export" -ItemType "directory" -Force

# Export all experiments in the workspace
Get-AmlExperiment |% {$i=0}{Export-AmlExperimentGraph -ExperimentId $_.ExperimentId -OutputFile "c:\AzureMLPS\export\exp$i.json"; $i++}
[/powershell]

Deploy new workspace in another location

To deploy the new workspace, you can refer to this more detailed article to get a sample ARM template to deploy a new machine learning workspace.
[powershell]
# Create a new resource group in West Europe.
$rg = New-AzureRmResourceGroup -Name "uniquenamerequired723" -Location "West Europe"

# Deploy a Resource Group, TemplateFile is the location of the JSON template.
$rgd = New-AzureRmResourceGroupDeployment -Name "demo" -TemplateFile "mlworkspace.json" -ResourceGroupName $rg.ResourceGroupName
[/powershell]

Copy the experiment into the new workspace

First, we need to update the configuration for AMLPS to the new location.
[powershell]
# Select workspace just created
$wsp = Get-AzureRmResource |? { $_.ResourceGroupName -Like $rgd.ResourceGroupName -AND $_.ResourceType -Like "Microsoft.MachineLearning/Workspaces"}

# Get the workspaceId
$wid = (Get-AzureRmResource -Name $wsp.Name -ResourceGroupName $wsp.ResourceGroupName -ResourceType $wsp.ResourceType -ApiVersion 2016-04-01).Properties.workspaceId

# Get the primary token
$wpt = (Invoke-AzureRmResourceAction -ResourceId $wsp.ResourceId -Action listworkspacekeys -Force).primaryToken

# Get the location of the workspace
$wil = (Get-AzureRmResource -Name $wsp.Name -ResourceGroupName $wsp.ResourceGroupName -ResourceType $wsp.ResourceType -ApiVersion 2016-04-01).Location

# Create the JSON config file
(New-Object psobject | Add-Member -PassThru NoteProperty Location $wil | Add-Member -PassThru NoteProperty WorkspaceId $wid | Add-Member -PassThru NoteProperty AuthorizationToken $wpt) | ConvertTo-Json > config.json
[/powershell]
Now we are able to import the experiments we copied from the previous workspace.
[powershell]
Get-ChildItem export\* -Include *.json |% {Import-AmlExperimentGraph -InputFile $_ }
[/powershell]

Test that all is working correctly

The simplest thing to do at this point is to list all the experiments under the workspace.
[powershell]
Get-AmlExperiment
[/powershell]
The result of this command should be all the experiment listed under your newly deployed workspace.

If you have no idea where to start with you Machine Learning experiment, you can have a look at a tutorial I wrote a while ago about getting started with Azure Machine Learning. You should also check the Cortana Intelligence Gallery where plenty of experiments are available for free.

The lean startup – book notes

Ahmet — Mon, 30 May 2016 15:13:51 +0000

The lean startup by Eric Ries is a very interesting book about creating a successful business.

It’s really a great read even if you are not working in a startup. Actually most large company moved away from previous development processes that didn’t adapt well to the new technologies. In today world, we ship software daily, sometimes multiple times per day. We even have multiple versions of the software being shipped in parallel (AB testing) to find out which version resonate the most with our users. This is, of course, very different than shipping a software on a CD every n year.

Notes

Below are some quotes from the book with some comments.

The fundamental goal of entrepreneurship is to engage in organization building under conditions of extreme uncertainty, its most vital function is learning. […] Learn what customers really want, discover whether we are on a path that will lead to growing a sustainable business.

Behind the word “learning” hides a lot of complexity, it’s about demonstrating empirically that we are going in the right direction. This also means that we should avoid vanity metrics (e.g. instead of tracking the number of visitor on your product page, track the number of visitor actually taking crucial actions within your product).

Our main concerns in the early days dealt with the following questions: What should we build and for whom? What market could we enter and dominate? How could we build durable value that would not be subject to erosion by competition.

Too often we don’t ask ourselves these questions, we are too focus in building a great product (from the engineering perspective) that we lost focus on the important point. The what, whom, how much are questions you should always have in mind.

The effort that is not absolutely necessary for learning what customers want can be eliminated. I call this validated learning because it is always demonstrated by positive improvements in the startup’s core metrics.

The idea is to build the minimum viable product that will confirm whether your product is going in the right direction, as measure by your core metrics. In large company, there is a tax that comes with being part of it. Some of the minimum things that needs to be build are not always relevant to our learning but are present to reinforce the company core values (e.g. security and privacy can to some extend be punted to later by startup, while group from a large company cannot.)

The value hypothesis tests whether a product or service really delivers value to customers once they are using it. […] Growth hypothesis tests how new customers will discover a product or service.

This should be part of your core metrics, are we delivering value? How does our product growth happen? They are the most important leaps of faith questions any startup face.

Answer four questions before investing engineering resources:

Do consumers recognize that they have the problem you are trying to solve?

If there was a solution, would they buy it?

Would they buy it from us?

Can we build a solution for that problem?

For larger company, there is an additional layer of complexity, there is often a sales organization interacting directly with the customers. Typically, product teams were not directly engaged with the customers. In my current group at Microsoft (Azure), the program managers are directly interviewing with the customers, engaging with them to understand their challenges and to establish great relationship with them.

Build a process of identifying risk and assumptions before building anything and then testing those assumptions experimentally

Use prototypes and interviews, it’s cheaper than building a product!

Build experiment, identify the elements of the plan that are assumptions rather than facts, and figure out way to test them. Using this insights, we could build a minimum viable product.

These reiterate the previous point, the goal is to reduce the cost of building a product that will be liked and used by customers.

The MVP is that version of the product that enables a full turn of the Build-Measure-Learn loop with a minimum amount of effort and the least amount of development time. […] MVP is designed not just to answer product design or technical questions. Its goal is to test fundamentals business hypothesis. […] A video can be a great form of MVP, a demonstration of how the technology, targeted at a community of technology early adopters.

To perform well within this cycle, your infrastructure needs to support fast iteration. You can’t really build, measure, and learn if it takes you one month to ship bits to production. Moving away from the traditional infrastructure is probably one of the biggest investment a company need to do in order to move to a leaner development cycle. Some fundamentals required:

Support for fast deployment to production
Support for Test in Production and experimentation (AB testing)
Channel to gather feedback from users and measure their interest.

Traditional approaches such as interaction design or design thinking are enormously helpful.

Give the concierge treatment to your early adopters, learn more and more about what it takes to make the product great. […] Measured according to traditional criteria, this is a terrible system, entirely nonscalable and a complete waste of time. But as a results of the learnings, the development efforts involve less waste than typical.

That’s the whole premise, invest time in learning from your users and make sure you build what they need. Then invest in building what delivers the greatest value for them first. Do things that doesn’t scale.

Most modern business and engineering philosophies focus on producing high-quality experiences for customers as a primary principle. […] These discussions of quality presuppose that the company already knows what attributes of the product the customers will perceive as worthwhile.

If your MVP feels rough to your customers, use it as an opportunity to learn about what they care about. Always ask yourself whether the customer care about design in the same way you do.

The truth is that most managers in most companies are already overwhelmed with good ideas. Their challenge lies in prioritization and execution, and it is those challenges that give a startup a hope of surviving.

A good reading about this is disruptive innovation. If you are part of a larger company, the prioritization effort should come from the learning you made from your users. When working with a lot of talented engineers it is sometime more exciting to solve big technical challenges rather than building the set of features your users need.

A startup’s job is to 1) rigorously measure where it is right now, confronting the hard truths that assessment reveals, and then 2) devise experiments to learn how to move the real numbers closer to the ideal reflected in the business plan.

This is true, not only for startup, but for all companies, not matter where they are in their lifecycle.

The rate of growth depends primarily on three things: the profitability of each customer, the cost of acquiring new customers, and the repeat purchase rate of existing customers.

It’s interesting to focus on rate of growth without being obsess by other cost factor (cost to serve), as the other cost will likely have an economy of scale.

The three learning milestones are: 1) use a minimum viable product to establish real data on where the company is right now, 2) tune the engine (of growth) from the baseline toward the ideal, 3) pivot or persevere.

Step #3 is rarer for larger company, I can’t remember a lot of example where large company pivoted. It seems the outcome of#3 is more fail or persevere.

Funnel metrics: behavior that are critical to your engine of growth (e.g. customer registration, download of application, trial, repeated usage, purchase).

User funnel analysis is so helpful to discover how you are doing and to help identify friction points that deserve your attention. When building new features to your product, always make sure that you can measure easily all the steps leading to using the feature and how the feature is used.

Cohort analysis: instead of looking at cumulative totals or gross numbers such as total revenue or total number of customers, one looks at the performance of each group of customers that comes into contact with the product independently (i.e. customer who joined each month).

Cohort analysis are useful to track the progress of metrics following some changes as well, once you ship new features, you can measure their effects on the cohorts carrying the change.

Metrics should honor the three A’s: Actionable (clear cause and effect), accessible (everyone can understand and access them), auditable (ensure the data is credible).

As soon as we formulate a hypothesis that we want to test, the product development team should be engineered to design and run this experiment as quickly as possible, using the smallest back size that will get the job done. Remember that although we write the feedback loop as build-measure-learn because the activities happen in that order, our planning really works in the reverse order: we figure out what we need to learn and then work backwards to see what product will work as an experiment to get that learning.

In theory that resonate really well to me, in practice however it sounds harder to pull. The development team is often assigned to work on multiple things and have different priorities. This assume that it’s pulling out a team of developer from their ‘normal’ duty is good. I’ve never seen this work successfully in larger and older group.

Technically, more than one engine of growth can operate in a business at a time […] successful startups usually focus one just one engine of growth, specializing in everything that is required to make it work.

Engine of growth: 1) sticky engine (relies on repeated usage), 2) viral engine (network effect), 3) paid engine (to acquire customer).

Andon cord: “Stop production so that production never has to stop”. […] You cannot trade quality for time. Defects cause a lot of rework, low morale, and customer complaints, all of which slow progress and eat away at valuable resources.

Balancing speed and quality is complex, having a system that force in the quality might have a higher cost at first but in the long run it will provide multiple benefits.

Use the Five Whys: this help go to the root of every seemingly technical problem.

But don’t use it to blame people, use it to learn and prevent similar problem to ever happen again. 5 Whys article on Wikipedia.

Conclusion

This book will stay on my shelf, what I like the most about it is:

Forefront the learning you will have from any development, make sure you can measure your success and failure. Build an MVP.
Stay in sync with you customers, learn what they need and how they work
The end goal is to be successful, this will most probably mean failure (learning opportunities) which will require for perseverance or for pivots.
Avoid vanity metrics, focus on real metrics