InfoCommerce Group Blog - InfoCommerce Group

Fake Data, Real Consequences

R Perkins — Thu, 18 Jun 2020 19:19:16 +0000

While we are hearing a lot these days about so-called “fake news,” we are also seeing instances of something that is arguably more pernicious: fake data. By “fake data,” I mean datasets that have been sloppily constructed or maintained, or datasets that appear to contain false data, either in whole or in part. What is particularly scary about fake data is that it is often used in hugely important applications that result in bad and even dangerous outcome.

My first encounter with fake data was in 2006. A reputable data company, First DataBank, had been producing an obscure database of wholesale drug prices. The company would survey drug companies and publish average prices for many widely prescribed drugs. Over time, and seemingly more through sloth than nefarious intent, First DataBank let the number of drug companies submitting data dwindle down to one, so the product was no longer producing industry average pricing. That’s embarrassing enough on its own, but because both Medicaid and Medicare were relying on the data to reimburse prescription claims, it resulted in vast reimbursement overpayments.

In 2009, we had the example of Ingenix, a medical data publisher. It published two datasets that reported what are called “usual, customary and reasonable” physician fees for various procedures. It was used by hundreds of health insurance companies to make out-of-network payments to physicians and was much-hated by physicians because of the low amounts they received because of these data products. Nothing about this is untoward, except for the fact that Ingenix was owned by UnitedHealth Group, one of the nation’s largest health insurers, and the products relied heavily on data from UnitedHealth Group. The lower the prices that Ingenix reported, the more money its parent company made.

More recently, you may recall the global LIBOR scandal, a dataset maintained by the British Bankers’ Association. It reflected the average interest rates banks would charge to loan money to each other and was a tiny dataset with one huge impact: LIBOR is used to determine interest rates for an estimated $350 trillionin loans, mortgages and other financial transactions. Traders at the banks supplying data to the BBA eventually realized that by colluding with each other they could manipulate the LIBOR rates and profit off this advance knowledge simply by reporting false data to BBA.

The most recent example relates to the COVID pandemic. A company called Surgisphere bursts on the scene with anonymized personal health data for nearly 100,000 COVID patients worldwide. The dataset is pristine: completely normalized, with every data element fully populated, and all information timely and current. Excited by this treasure trove of quality data, unavailable elsewhere, several reputable physicians went to work on assessing the data, with their most notable conclusion that the drug hydroxychloroquine was not effective in treating COVID patients. The results, published in a reputable medical journal, caused the World Health Organization and several countries to suspend randomized controlled trials that had been set up to test the drug.

With more than a few medical researchers suspicious of this dataset that had emerged from nowhere, Surgisphere suddenly found itself under scrutiny. One newspaper discovered that the company, which had originally been founded to market textbooks, had a total of six employees, one of whom was a science fiction writer, and another of whom had previously worked as an actor in the adult film industry. Surgisphere claimed that non-disclosure agreements prohibited it from disclosing even the names of the 600 hospitals that had allegedly provided it with its data. The Surgisphere website has recently gone dark.

The simple lesson here is that if you rely on data products for anything important, it’s necessary to trust but verify. The less the data provider wants to tell you, the more questions you need to ask. Claiming non-disclosure prohibitions is an easy way to hide a host of sins. If you are relying on a data source for industry averages of any kind, at a minimum confirm the sample size. You should also assess whether the data producer has any conflicts of interest that could influence what data it collects or how it presents the results. The good news, of course, is that this is also an opportunity for reputable data producers to showcase their data quality.

When Bad Data Is Good Business

R Perkins — Fri, 05 Jun 2020 14:28:36 +0000

The New York Times recently ran a story that describes the inner workings of the tenant online screening business, where companies create background reports for landlords on prospective apartment renters. These are companies that access multiple public databases to aggregate data on a specific individual that the landlord can use to determine whether to rent to that person. The article is a scary take-down of a segment of the data business that decided to compete largely on price, and in the process threw quality out the window.

This is not a small segment of the data industry. Indeed, it is estimated that there are over 2,000 companies involved in generating both employment background and tenant screening reports, generating over $3.2B annually. Companies in this segment range from a handful of giants to tiny mom-and-pop operators.

As the Times article notes, the tenant screening segment of the business is largely unregulated. In the tight market for rental apartments, landlords can afford to be picky and apartments rent quickly, so prospective renters typically will lose an apartment before they can get an erroneous report corrected. And with no central data source and lots of small data vendors, it’s impossible to get erroneous data corrected permanently.

The Times article pins the problem in large part on the widespread use of wildcard and Soundex name searches designed to search public databases exhaustively. And with lots of players and severe price pressure, most of the reports that are generated are fully automated. In most cases, landlords simply get a pile of whatever material results from these broad searches. In some cases, the data company provides a score or simply a yes/no recommendation to the landlord. Not surprisingly, landlords prefer these summaries to wading through and trying to assess lengthy source documents.

The core problem is that in this corner of the industry, we have the rare occurrence of unsophisticated data producers selling to unsophisticated data users. Initially, these data producers differentiated themselves by trying to tap the greatest number of data sources (terrorist databases, criminal databases, sex offender databases). This strategy tapped out pretty quickly, which is why these companies shifted to selling on price. To do this, they had to automate, meaning they began to sell reports based on broad searches with no human review. There are also a lot of data wholesalers in this business, meaning it is fast and relatively inexpensive to set yourself up as a background screening company.

There is also a more subtle aspect to this business that should interest all data producers. The use of broad wild card searches is ostensibly done because “it’s better to produce a false positive than a false negative.” This sounds like the right approach on the surface, but hiding underneath is an understanding that the key dynamic of this business is a need to deliver “hits,” otherwise known as negative information. This is where the unsophisticated data user comes into play. Landlords evaluate and value their background screening providers based on how frequently they find negative information on an applicant. If landlords don’t see negative information regularly, they begin to question the value of the screening company, and become receptive to overtures from competitors who claim they do more rigorous screening. In other words, the more rigorous your data product, the more you are exposed competitively.

There’s a lesson here: if you create a data product whose purpose is to help users identify problems, you need to deliver problems frequently in order to succeed. This sets up a warped incentive where precision is the enemy of profit. Place this warped incentive in a market with strong downward price pressure, and the result is messy indeed.

Facebook Stores Come Up Empty

R Perkins — Thu, 28 May 2020 15:53:54 +0000

Facebook recently rolled out, to great fanfare, a new offering called Facebook Stores. In brief, it allows businesses with Facebook pages to add e-commerce functionality to those pages. It’s a free service to businesses, because Facebook hopes to profit mightily off transaction fees and additional advertising on its sites. Reportedly, over one million businesses have already signed on to use this new feature.

It’s a bit difficult to assess the significance of this new offering. This is inarguably a smart, if not particularly inspired move by Facebook to cut itself (as all marketplaces and platforms dream of doing) into the revenue of the businesses on its site. But the value Facebook adds through Facebook Stores isn’t all that large. That’s because, despite its huge base of users, Facebook isn’t doing anything to drive these users to Facebook Stores. Anyone who makes a purchase through Facebook Stores is an existing customer or prospect of the store owner or has been driven there by paid advertising on Facebook. Nor does Facebook offer any classification structure or taxonomy to help its users discover businesses on Facebook. You really need to already know they are there. Indeed, businesses can’t even really use Facebook Stores in place of an ecommerce website because Facebook business pages provide only limited access to non-users of Facebook. In many ways, I feel about Facebook Stores the way I feel about Apple’s App Store: yes, it will make a lot of money, but imagine how much money it could have made had it been done right.

For those who are writing breathlessly about Facebook Stores as the dawn of “social commerce,” there is a theme. First, they say, forget about Facebook and focus on its sister company Instagram, where brands can promote new products and users can order them seamlessly. It sounds interesting, but when you pick it apart the same issues arise: you’re spending money with Instagram to build your audience and drive traffic, and then you give a percentage of your revenue to Instagram in exchange for capabilities you already have on your website.

In short, Facebook Stores is a smart move for Facebook. Is it a smart move for small businesses? I remain unconvinced. As the saying goes, “If your business depends on a platform, you don’t have a business.”

No Profit, No Problem

R Perkins — Fri, 15 May 2020 17:23:02 +0000

I’ve been asked a number of times if not-for-profit data producers have an inherent advantage over for-profit data producers who may be selling similar or identical data. In fact, there are a lot of non-profits founded, at least in part, to build databases and disseminate them.

As just a few examples, you may already be familiar with GuideStar (now called Candid and a 2004 Infocommerce Model of Excellence award winner) that provides financial data on non-profits. It has other non-profit competitors such as Charity Navigator, ProPublica and even the Better Business Bureau. But this space also includes for-profit players such as Metasoft Systems too. In the educational world, non-profit GreatSchools (a 2007 Infocommerce Model of Excellence award winner) competes with for-profit players such as U.S. News and Niche.com. The further you dig into the world of data, the more non-profit players you find, often competing directly with for-profit data providers.

So in head-to-head competitive match-ups, do non-profits have an advantage? In my experience, non-profits do have a number of advantages. Their primary one is in perception. Particularly when it comes to data collection, non-profits seem less threatening, they are viewed as neutral and independent, their need for data is rarely questioned, and many will supply data to non-profits because they feel they are helping out or supporting a cause. This warm and fuzzy perception extends to marketing and sales as well. Having seen it first-hand, I have no doubt that non-profits prefer doing business with other non-profits. There is a sense of shared purpose, and a belief that one non-profit won’t take advantage of another non-profit. Commercial data buyers won’t buy data from a non-profit simply because it is a non-profit, but non-profits will get full and equal consideration along with for-profit data vendors.

But that doesn’t mean it’s easy going for non-profit data providers. As mission-driven organizations, many give away their data, limiting their revenue opportunity to such things as API access and customer datasets. Also, while non-profits like doing business with other non-profits, in part that’s because they expect whatever they get from another non-profit will be heavily discounted if not free. Selling against for-profit competitors, the non-profit is often at a disadvantage because it can’t as easily invest in the newest technologies and third-party datasets either because of resource constraints or because staying competitive begins to conflict with its own mission objectives.

On balance, I think that non-profit data producers do have a number of marketplace advantages, but these advantages are largely offset by marketplace expectations that non-profits must offer low-cost or free data, and competitive realities that make it hard to sell against a determined, for-profit competitor.

Fishing In the Data Lake

R Perkins — Fri, 08 May 2020 13:37:41 +0000

You have likely bumped into the hot new IT buzzword “data lake.” A data lake is simply a collection of data files, structured and unstructured, located in one place. This is in the eyes of some an advance over the “data warehouse,” where datasets are curated and highly organized. Fun fact: a bad data lake (one that holds too much useless data) is called a “data swamp.”

What’s the purpose of a data lake? Primarily, it’s to provide raw input to artificial intelligence and machine learning software. This new class of software is both powerful and complex, with the result that it has been bestowed with near-mystical qualities. As one senior executive of a successful manufacturing company told me, his company was aggressively adopting machine learning because “you just feed it the data and it gives you answers.” Yes, we now have software so powerful that it not only provides answers, but apparently formulates the questions as well.

The reality is much more mundane. This will not surprise any data publisher, but the more structure you provide to machine learning and artificial intelligence software, the better the results. That’s because while you can “feed” a bunch of disparate datasets into machine learning software, if there are no ready linkages between the datasets, your results will be, shall we say, suboptimal. And if the constituent data elements aren’t clean and normalized, you’ll get to see the axiom “garbage in, garbage out” playing out in real life.

It’s a sad reality that highly trained and highly paid data scientists still spend the majority of their time acting as what they call “data wranglers” and “data janitors,” trying to smooth out raw data enough that machine learning will deliver useful and dependable insights. In a timely response to this, software vendor C3-AI has just launched a Covid 19 data lake. Its claimed value is rather than just a collection of datasets in one place, C3-AI has taken the time to organize, unify and link the datasets.

The lesson here is that as data producers, we should never underestimate the value we create when we organize, normalize and clean data. Indeed, clean and organized data will be the foundation for the next wave of advances in both computing and human knowledge. Better data: better results.

Getting to the Top

R Perkins — Fri, 01 May 2020 12:41:37 +0000

It’s very gratifying to me to watch how quickly and successfully the data industry has evolved in its lead generation capabilities over the last two decades. We’ve moved from the legacy print directory model to highly sophisticated, multi-sourced signals and other inferential data to more precisely identify and pre-qualify sales leads. But where do we go from here?

I have long said that the path forward for data publishers is to move up the so-called value pyramid, from poorly differentiated “there’s a pony in there somewhere” lists that characterized the legacy print directory era to today’s evidence-based, high-confidence, highly targeted sales leads. The top of the value pyramid is actually making the sale on behalf of your customer, presumably in exchange for a sizable commission to justify the effort. Many data producers would be thrilled to shift from $100 sales leads to $10,000 commissions. But when you even scratch the surface of this idea, you see large obstacles, not the least of which is trying to scale a business model like this.

So what’s the next highest level of value? Pre-qualifying leads. In this model, the data producer takes the leads it is generating, and further qualifies them by making direct contact, and asking, for example, “are you actively in the market for a new CNC milling machine?” If the answer is in the affirmative, you have developed information of extremely high value. A number of companies that sell technology sales leads have been doing this for a while.

Alas, at the present time, this is a market-dependent idea. Technology marketing and sales teams tend to be highly sophisticated when it comes to lead management. But for most markets, as I’ve noted before, marketers are still out primarily looking for lists to load into automated marketing platforms, and most sales teams prefer to trust their instincts and sales prejudices over verified data, meaning great leads end up on the floor and the data producer is told its data wasn’t very good.

All this suggests to me that for data producers to move further up the value pyramid, a lot of market education is going to be required first, and that will take a lot of time and resources. We’ll get there, but not anytime soon.

The New Privacy Laws: Be Prepared

R Perkins — Fri, 17 Apr 2020 15:04:12 +0000

First came GDPR (General Data Protection Regulation). More recently came CCPA (California Consumers Protection Act). According to experts, twelve more states are currently considering privacy laws of their own. Given the current political environment, there is little hope for a single federal privacy law. Short summary: it’s a big mess that is going to get even messier.

For data producers, it remains unclear what the impact of these laws might be. After all, most of us are B2B companies, and most of us hold relatively little information on individuals. Does that mean as an industry we are safe? It’s hard to say, as most of the emerging privacy regulations are oriented to consumer companies collecting individual data incidental to their primary business activity. To date, no privacy legislation has specifically addressed B2B companies that collect data as their primary business, so it’s unclear if we’ve slipped past the regulators entirely or whether we may end up as unintended roadkill.

An interesting set of short videos from law firm Baker McKenzie is well worth watching, if only to illustrate how far-reaching and potentially disruptive this new wave of legislation is likely to be.

It starts with employee data – are you prepared to show an employee, on request, all the information you maintain on that person? It moves into marketing, where the contents of your CRM system are likely to be open to inspection by any individual requesting it – are you ready to share call notes and other third-party data you’ve collected? It probably reaches into your datasets as well: the more information you collect on individuals, even if public source, the more you need to prepare. You will likely also need to start re-thinking about the terms under which you license data. In this new world, lawyers are suggesting that B2B companies not hold onto data any longer than needed, not a great piece of news in an industry where building deep historical data remains a big opportunity area. In short, in some way and form, every data publisher is likely to be impacted by the emerging data privacy legislation. Worst of all, many states are giving teeth to their privacy laws by allowing private lawsuits. Yes, anyone with a privacy beef will be able to haul you into court and seek damages

We are moving into a brave new world in this area, and it’s going to be disruptive and painful. Like it or not, you’ve got to stay on top of developments here, because preparedness is the best form of protection.

You Will NEVER Replace This

R Perkins — Fri, 10 Apr 2020 18:24:29 +0000

Elon Musk is probably best-known as the founder of Tesla. When Elon isn’t re-inventing the automobile, he’s running SpaceX, a company that builds and launches rockets and spacecraft. To keep busy, he also runs The Boring Company that plans to tunnel highways under major cities to relieve traffic congestion (a company that also generated a reported $10 million selling flamethrowers to consumers – yes you read that correctly!). On the more esoteric end of the scale, he also founded Neuralink, a company focused on developing brain-computer interfaces. Love him or hate him, you can’t deny he’s brilliantly innovative.

Many people know that Elon Musk got rich as one of the founders of PayPal. Far fewer know that his initial business success came as the creator of an online yellow pages company called Zip2 way back in 1996. Seeking to partner with print yellow publishers, he and his brother visited a top executive at the largest yellow pages publisher in Canada. After pitching their vision, the executive responded by picking up one of his thickest directories off his desk, throwing it at them and saying, “You ever think you’re going to replace this?”

Well, 25 years later, we know the answer to that one. Not only did the Internet replace the print yellow pages business, it largely destroyed the legacy yellow pages industry as well. Not surprisingly, the Musk brothers did well when Zip2 was ultimately sold for $300 million.

But what caused the death of the huge and fabulously profitable yellow pages industry? At the time, a lot of people (including me) thought the Internet would herald a new era of growth for the industry. The answer, in large part, was hubris.

Almost without exception, the big yellow pages publishers decided the fastest path to online riches was to take their regional products and go national. Overnight, these companies bought national business databases to roll out national yellow pages products. In doing so, they moved from having deep information on all the companies in their region, to having nothing more than name, address and telephone for all companies nationally. They vastly degraded the information value of their products in the belief that advertisers would flock to their doors. That’s critically important, because with yellow pages and buying guides, the advertising is the content.

That leads to the second miscalculation: these publishers all had regional rather than national salesforces. Good as these salespeople were, these publishers didn’t have the capability to sell nationally. This led to the third big miscalculation: the publishers all had regional brands and couldn’t come to grips with the fact that nobody had heard of them outside their regions. Without strong national brands, prospective advertisers yawned at these new national products that seemingly emerged out of nowhere.

Of course, the other big shift is that search engines got better. While still imperfect, in large part you now can find a plumber in your area with a simple search. And businesses flock to advertise on the search engines because with pay-per-click pricing, their advertising spend is now (at least in theory) more efficient.

The key take-away lessons for data publishers? First, a database that is a mile wide and an inch deep isn’t an effective product strategy these days. Far better to know a lot about a specific group than to know a little about everyone. Second, advertising-driven online data businesses are tougher than ever to pull off. Third, when you start believing your own press releases, things never end well. Fourth, when Elon Musk calls, listen before you throw something!

A Good Business in Bad Times

R Perkins — Fri, 27 Mar 2020 12:14:38 +0000

As I write this, the federal government has announced 3.3 million new unemployment claims – and this in just one week. In other words, this could likely represent just the tip of the iceberg. The human toll of the coronavirus is difficult to comprehend, with the toll on businesses not far behind.

In any sudden downturn, it has long been understood that some businesses will always get hit harder and faster than others. The rule of the game in any business downturn is to preserve cash in anticipation of reduced revenue. Consequently, expense reduction becomes the focus. Rightly or wrongly, most companies view advertising and marketing as something that can be suspended for some period of time with little consequence. Other business activities, such as company meetings and events, quickly get postponed. Every company reacts slightly differently, and often in uniquely arbitrary and sometimes ill-advised ways. But the goal is always the same: slow spending as much as possible to conserve cash.

With everyone trying to cuts costs and slow payables at the same time, an adverse ripple effect is created that amplifies the pain. That’s why in widespread business downturns, few businesses are left truly unscathed by the resulting fallout.

Can one ever find safety from events of this magnitude? Probably not, but while few if any businesses will be totally untouched, some business models are clearly stronger than others.

The information business is inherently one of the stronger industries to be in right now. That’s not because information products are uniformly essential to their customers. We learned during the Great Recession that many information products believed to be “must-have” became “nice-to-have” almost overnight. But the B2B subscription model employed by most information and data publishers adds an important additional level of resiliency.

B2B subscriptions tend to be, in effect, annual or even multi-year contracts. Many are prepaid. Many are difficult to cancel during the contract term. This buys information and data publishers the most important protection of all: time. Time to ride out the storm, for conditions to improve or at least for calmer heads to prevail. Sure, new subscriptions will decline and renewal rates will drop during a downturn, but the bulk of the business will remain relatively safe.

In addition to being contractual and often prepaid, subscriptions to information products typically are not high visibility or so expensive that they capture the early attention of cost-cutters. And for data products in particular, they don’t sit idle during downturns like our current one, because they are just as useful to employees working from home.

Some data products have even a further level of protection because they are embedded into the workflow and systems of their customers. Simply put, it’s too slow, complicated and sometimes even risky to turn them off.

As I said earlier, there are no winners in a global pandemic. But the importance and value of data products, coupled with the strength of the dominant industry business model, will help this industry spring back quickly.

This pandemic is bigger than all of us. But if we all act responsibly, we can minimize severity and duration and get back to business sooner. Stay safe … and stay healthy. We’ll get through this if we all work together!

It’s Hard to Trust This One

R Perkins — Fri, 13 Mar 2020 13:44:48 +0000

Recently, ADP (the association of yellow page publishers, not the payroll company) announced something called “Trusted Local Directory,” an online directory of “Trusted Local Businesses.” To become a Trusted Local Business, a company is “thoroughly investigated” and if worthy receives both a Trusted Local Business seal for its use, along with a listing in the Trusted Local Directory.

I give ADP kudos for trying to find ways to breathe new life and relevance into the yellow page directory business, but I have to admit some skepticism as well.

First, this model is not a new one, and the track record of third-party trust evaluators isn’t a good one. Trust is hard. Perhaps more to the point, trust is expensive. And to a great extent, trust is in the eye of the beholder – simply defining how a company can objectively prove it is trustworthy is remarkably challenging. That’s why this is one tough model.

Consider as a case study the Better Business Bureau (BBB). They’ve been providing assurances of trust for over 100 years. But they’ve come be viewed as a consumer advocacy organization when in fact they are supported by their business members, setting up all sorts of inherent conflicts. Moreover, new BBB business members automatically receive a top rating upon joining. The rating may then be reduced over time depending on how the business handles its complaints. That’s a loophole that scammers can drive a truck through. Moreover, BBB has set itself up to process and resolve mountains of consumer complaints, something it doesn’t get paid to do. More fundamentally, BBB has a pay to play business model. It makes no money unless a business becomes a member, and once a member the business automatically receives a top rating from BBB.

If BBB has trouble with this model, consider that ADP has the additional hurdle of being an unknown brand. Moreover, rather than leveraging the directories of its members, ADP has created the Trusted Local Directory as a new directory site that will need to build usage from scratch, a daunting task at this late date. And lest you think that the Trusted Local Directory is a directory of trusted local businesses, be advised that it appears to be a national directory of all businesses, one that offers no more than business name, address and phone.

An online directory of trusted local businesses could be a good and useful product. But the business model inherently fights you every step of the way. A directory like this needs a critical mass of businesses to be useful and viable. But assessing trust at anything more than a cursory level is slow, manual, expensive and difficult to scale. So you can’t do it for free. But by charging for inclusion, fewer businesses will want to be included. To combat this you can reduce your price, which means a less rigorous assessment, which in turn limits the value of the product. Alternately, you can give the impression of a rigorous review without actually doing the work, but that is more likely to lead to court than to success.

Crowdsourced reviews have come closest to making the third-party review model work. They are low cost and do readily scale, but many suffer from gaming and have credibility issues of their own. To succeed, they need a lot of policing and quality control, and that quickly gets complex and expensive, and there only a few examples (TrustPilot is one good one) of meaningful monetization with this model.

Again, kudos to ADP for thinking outside the box, but it doesn’t seem to me they’ve cracked the code on this inherently challenging business model. And for anyone else considering this model, trust me, it’s hard.

Variable Pricing, Data-Style

R Perkins — Fri, 28 Feb 2020 14:36:22 +0000

Variable pricing is a well-known pricing strategy that changes the price for the same product or service based on factors such as time, date, sale location and level of demand. Implemented properly, variable pricing is a powerful tool to optimize revenue.

The downside to variable pricing is that it has a bad reputation. For example, when prices go up at times of peak demand (which often translates into times of peak need), that’s variable pricing. Generally speaking, when you notice variable pricing, it’s because you’re on the wrong end of the variance.

Variable pricing lends itself nicely to data products. But rather than thinking about a traditional variable pricing strategy, consider pricing based on intensity of usage.

Intensity of usage means tying the price of your data product to how intensely a customer uses it – the greater the use, the greater the price. Intensity pricing is not an attempt to support multiple prices for the same product, but rather an attempt to tie pricing to the value derived from use of the product, with intensity of usage a proxy for value derived from the product.

For data producers, intensity-based pricing can take many forms. Here are just a few examples to fuel your thinking:

1. Multi-user pricing. Yes, licensing multiple users and seats to large organizations is hardly a new idea. But it’s still a complex, mysterious thing to many data producers who shy away from it, leaving money on the table and probably encouraging widespread password sharing at the same time. The key to multi-user pricing is not to try and extract more from larger organizations simply because “they can afford it,” (a contentious and unsustainable approach), but to tie pricing to actual levels of usage as much as possible.

2. Modularize data product functionality. Not every user makes use of all your features and functionality. Think about identifying those usage patterns and then re-casting your data product into modules: the more modules you use, the more you pay. We all know the selling power of those grayed-out, extra cost items on the main dashboard!

3. Limit or meter exports. Many sales-oriented data products command high prices in part because of the contact information that they offer, such as email addresses. Unfortunately, many subscribers still view data products like these as glorified mailing lists to be used for giant email blasts. This is a high intensity use that should be priced at a premium. A growing number of data producers limit the number of records that can be downloaded in list format, charging a premium for additional records to reflect this high-intensity type of usage. It’s similarly possible to limit and then up-charge certain types of high-value reports and other results that provide value beyond the raw data itself.

4. Modularize the dataset. Just as few users will use all the features available to them in a data product, many will not use all the datamade available to them. For example, it’s not uncommon for data producers to charge more for access to historical data because not everyone will use it, and those who do use it value it highly. Consider whether you have a similar opportunity to segment your dataset.

While your first consideration should be revenue enhancement, also keep in mind that an intensity-based pricing approach helps protect your data from abuse, permits lower entry-level price points, creates up-sell opportunities, and properly positions your data as valuable and important.

There are competitive considerations as well. When you are selling an over-stuffed data product in order to justify a high price, the easiest strategy for a competitor is to build a slimmed-down version of your product at a much lower price – Disruption 101. You simply don’t want to be selling a prix fixe product in an increasingly a la carte world (look at the cable companies and their inability to sustain bundled pricing even with near-monopoly positions).

Good Data + Good Analytics = Good Business

R Perkins — Fri, 21 Feb 2020 18:00:58 +0000

Mere weeks ago, I made my predictions of what this decade would bring for the data industry. I said that while the decade we just left behind was largely about collecting and organizing data, the decade in front of us would be about putting these massive datasets to use. Machine learning and artificial intelligence are poised to make data even more powerful and thus valuable … provided of course the underlying data are properly structured and standardized.

Leave it to 2013 Infocommerce Model of Excellence winner Segmint to immediately show us what these predictions mean in practice through its recent acquisition of the Product and Service Taxonomy division of WAND Inc. WAND, by the way, is a 2004 Infocommerce Model of Excellence winner, making us especially proud to report this combination of capabilities.

Segmint is tackling a huge opportunity by helping banks better understand their customers for marketing and other purposes. Banks capture tremendous amounts of transactional activity, much of it in real-time. The banking industry has also invested billions of dollars in building data warehouses to store this information. So far, so good. But if you want to derive insights from all these data, you have to be able to confidently roll it up to get summary data. And that’s where banks came up short. You can’t assess customer spending on home furnishings unless you can identify credit card merchants who sell home furnishings. That’s where Segmint and WAND come in. How many ways can people abbreviate and misspell the name “Home Depot”? Multiply that by billions of transactions and millions of companies, and you start to get the idea of both the problem and the opportunity.

When WAND is done cleaning and standardizing the data, Segmint goes to work with its proprietary segmentation and predictive analytics tools. Segmint helps bank marketers understand the lifestyle characteristics of its customers and target them with appropriate messages both to aid retention and sell new products. These segments are continuously updated via real-time feeds from its bank customers (all fully anonymized). With that level of high quality, real-time and granular data, Segmint can readily move from profiling customers to predicting their needs and interests.

Simply put: this is the future of the data business. It starts with the clean-up work nobody else wants to do (and it’s why data scientists spend more time cleaning data than analyzing it) and then uses advanced software to find actionable, profitable insights from the patterns in that data. This is the magic of the data business that will be realized in this new decade. And we couldn’t be prouder that two Infocommerce Model of Excellence winners are leading the way … together. Congrats to both!

Google: Now Organizing the World’s Data

R Perkins — Fri, 14 Feb 2020 13:24:16 +0000

Google's mission is “to organize the world's information and make it universally accessible and useful”. How does its newest foray into data stack up? Early this year, Google officially launched something it calls Dataset Search. It’s been in public beta since 2018 (I still contend that the concept of “public beta” remains Google’s single greatest technological innovation), but now it’s for real and according to Google, already contains information on over 25 million datasets.

Dataset Search is loosely tied to Google Scholar, a specialized version of the Google search engine intended to make it easier to search for academic papers. Along those lines, Google sees Dataset Search as something most useful to scholars and data journalists.

Improving discovery of datasets is a worthy and important task. Quite likely, 25 million datasets are only a tiny fraction of what exists online. And in this age of open data, Google is tackling a big task at just the right time.

Anyone can add a database to Dataset Search. Include some metatags on the relevant webpage, and the Google crawler will find it, and automatically inject a record into Dataset Search. Is it worth the effort? Well, it’s free and it’s fairly easy to participate, and it’s Google. Google does note that information in Dataset Search is added to the Google Knowledge Graph, meaning it connects Dataset Search records to all other information it knows about the organization that owns the dataset. Some suspect this may improve your overall Google search ranking, though Google is necessarily playing coy on this point.

What’s in Dataset Search today? I have to say, while it has potential, it’s going through some growing pains. Pro Publica has a very good database of financial data on non-profit organizations. However, rather than list the dataset once, Pro Publica appears to have coded its database so all 800,000 records in its database have become separate records in Database Search. Humorously, for some other organizations, a CEO headshot will be displayed instead of a company logo. This will all be corrected in time. My biggest disappointment, however, is likely to remain: Dataset Search is a database of databases searchable primarily by full text queries. There are very few parameters that can be applied to usefully narrow a search, so much like the primary Google search engine itself, you will still have to manually browse through endless search results to find what you want.

I do want to stress that Dataset Search is open to commercial data products. It’s an easy, free way to get some additional online exposure for your products and if it bumps up your search result rankings, it’s well worth the effort. And as Dataset Search evolves, it may well become an accepted way to discover and source commercial data products. Why not get in on the ground floor?

What Facebook Knows and Doesn’t Know

R Perkins — Fri, 31 Jan 2020 16:47:45 +0000

Privacy concerns have been in the forefront of the news lately, and no article discussing privacy is complete without mentioning Facebook. That’s because Facebook is considered to be the all-knowing machine that’s tirelessly collecting data about us and turning it into insights that can be used to better market things to us with extreme precision. Certainly Facebook isn’t the only online juggernaut with this strategy and sophisticated data collection capabilities, but in many ways it’s the poster child for our collective concerns and anxieties.

I joined Facebook in 2007. At the time, it was becoming the next big thing, and I wanted to see what it was all about. After some initial excitement, I noticed my usage dropping as the years went by. My usage massively dropped in 2019 when I somehow changed my default language settings to German and I didn’t feel any real urgency to figure out how to undo it, all this to say I am certainly not a typical Facebook user.

While not a high intensity Facebook user, I am a high intensity data nerd, so when I read an article that explained how to peek under the hood to see in detail what Facebook knows about you, and what it has learned about me from third parties, I of course could not resist. If your interest is equally high, start your journey here: https://www.facebook.com/off_facebook_activity/

I clicked all the options so that I could see everything Facebook knew about me. While not a heavy user, I was a long-term user, and I imagined Facebook had likely learned a lot about me in 13 years. In due course, Facebook presented me with a downloadable Zip file that contained a number of folders.

The folder “Ads and Businesses” turned out to be the money folder. This is where I learned my personal interests as divined by Facebook – all individual categories that can be selected by marketers. Here are some highlights of my interests:

Cast iron (who doesn’t love cast iron?)
Scratching (what can I say?)
Tesla (Facebook helpfully clarified that my interest was not in the car, but rather the band … the band?)
Oysters (I don’t eat them)
Skiing (I don’t ski)
Star Trek (absolutely true – when I was about 14 years old)

There were about 50 interest categories in all; not all wrong, but overall far from an accurate picture. What I infer by looking at these interest categories is that they are keywords crudely extracted from various ads I had clicked on over the years. I say “crudely” because these interest tags don’t represent an organized taxonomy; there is no hierarchy, and there is only a lackluster attempt to disambiguate. For example, one of my interests is “online.” Without any context, this is useless information. And if Facebook assesses the recency of my interests, or the intensity of my interest (how many times, for example, did I look at things relating to cast iron?), it is not sharing these data with its users.

If Facebook underwhelmed me with its insights into my interests, the listing of “Advertisers who uploaded a contact list with my information” totally confused me. I was presented with a list of literally hundreds of businesses that ostensibly had my contact information and had tried to match it to my Facebook data. What I saw on this list were probably close to a hundred local car dealerships from all over the country, followed by almost as many local real estate agencies. I feel certain, for example, that I have never visited the website of, much less interacted with, International Honda of Sheboygan, WI. But this car dealership – reportedly – has my contact information and is matching it to Facebook.

There are a few possible explanations for this. The one I find most likely is that in the case of automobiles, some unscrupulous middlemen are selling the same file of “leads” to unsuspecting car dealers nationwide. It could also be inexperienced or bad marketers or marketing agencies. Some free advice to Toledo Edison, Maybelline, The Property Girls of Michigan, Bank Midwest and Choctaw Casinos and Resorts – take a look at your list sources and maybe even your marketing strategies, because something seems broken.

Looking at your own Facebook data gives you a rare opportunity to see and evaluate what’s going on behind the curtain. To me, Facebook’s secret sauce really doesn’t appear to be its technology. Grabbing keywords from ads I have clicked is utterly banal. Offering marketers hundreds of thousands of interest tags does in fact allows for extreme microtargeting, but in the sloppiest, laziest possible way. Capturing all my ad clicks is useful and valuable, but hardly cutting edge. What appears to make Facebook so valuable seems not to be the data it has collected, but the fact it has collected data on a hitherto unknown scale. Knowing that I have an interest in flax (yes, this is really one of my reputed interests!) even if true is pretty useless until you get enough scale to identify thousands of people interested in flax, at which point this obscure data point suddenly acquires monetary value.

What my Facebook data suggest is that while it may not be good enough to deliver the precision and accuracy many marketers have bought into, what it has done is create “good enough” data at extreme scale. And that is proving to be even better than good enough.

Email: Valuable However You Look at It

R Perkins — Thu, 23 Jan 2020 21:05:05 +0000

The Internet has evolved dramatically in the last 25 years. But one aspect of online interaction has remained largely untouched. I am talking about the humble email address.

Despite the growth of web-based phone and video and the dominance of social media, the importance of the email address has actually increased. Indeed, the primary way to log into these other burgeoning communications channels is most commonly to use your email address as your username. After all these years, it’s easy to take it for granted. But from a data perspective, it’s worth taking a few minutes to explore some of its hidden value.

First, an email address is a unique identifier. Yes, many people have both a personal and business email address, but those email addresses are role-based, so they generally remain unique within a given role. Moreover, many data aggregators have been busy building databases of individuals and all their known email addresses, making it easy to resolve multiple email addresses back to a single individual.

Unique identifiers of all kinds are extraordinarily important because they provide disambiguation. That means that you can confidently match datasets based on email address because no two people can have the same email at the same time.

But email addresses aren’t just unique identifiers, they are persistent unique identifiers. That means that people don’t change them often or on a whim. Further, unlike telephone numbers, email addresses tend not to be re-issued. That’s because businesses work hard to avoid re-issuing email addresses and as to personal emails, they are typically very cheap to keep and a big hassle to change, resulting in a lot of stability.

Let’s go a step further: email addresses are persistent, intelligent unique identifiers. At least for business use, email addresses are not only tied to a particular company, the company name is embedded in the email address. And again, data aggregators have been hard at work mapping domain names to detailed information on the companies behind them. That’s why an increasing number of B2B companies actually prohibit people from signing up for such things as a newsletter using a personal email address. A personal email address (e.g. arty1745@gmail.com) tells them little; a business email address (e.g. jsmith@pfizer.com) readily yields a wealth of company demographics with which to both target promotions and build audience profiles. Indeed, even the structure of email addresses has been monetized. There are, for example, companies that will append inferred email addresses based on the naming convention used by a specific company (e.g. first initial and last name). It’s also interesting that the top level domain can tell you the nature of the organization (e.g. “.org”), the country where it is operating (e.g. “co.uk”), and even the nature of its business (e.g. “.edu”).

The unique format of the email address also adds to its value. While the length of an email address will vary, the overall pattern with the distinctive @ sign makes it easy to harvest and extract. It also makes it possible to link text documents (perhaps academic papers that include the email address of the author) to data records.

Sure, email addresses have real value because they can put marketing messages under the noses of prospects, but to a data publisher, email addresses are worth a whole lot more.

When Data Is Smarter Than Its Users

R Perkins — Fri, 17 Jan 2020 16:20:22 +0000

In my review of the decade past and my predictions for our new decade, the common thread is that the quality of commercial data products has advanced immeasurably, as has their insight and predictive capability. As an industry, we’ve accomplished some truly remarkable things in the past ten years by making data more powerful, more useful and more current.

This said, data buyers remain far less sophisticated than the datasets they are buying. While buyers of data used for research and planning purposes seem to both appreciate and use powerful new data capabilities, marketers – generally speaking – do not. Even worse, this problem is ages-old.

Earlier in my career, I spent several years in the direct marketing business. Even back in the 1980s we were doing file overlays, assessing purchase propensity and building out detailed prospect profiles based on hundreds of individual data elements. It was slower and sloppier and harder back then, but we were doing it. We even had artificial intelligence software, though one project in particular I recall involving a million customer records required that we rent exclusive use of a mainframe computer for two weeks! And not only did we have the capability, we had the buy-in of the marketing world. There was a fever pitch of interest in the incredible potential of super-targeted marketing.

But what we quickly learned as mailing list providers was that while sales and marketing types talked quality, what they bought was quantity. If you went to any organization of any size and said, “we have identified the 5,000 absolute best prospects in the country for you, all ready, willing and able to buy,” you would get interest but few if any takers. At best, you’d have marketers say that they’d throw these prospects in the pot with all the others – as long as they weren’t too expensive.

From this experience came my epiphany: marketers had no experience with high quality prospects. They were so used to crappy data they had built processes and organizations optimized to churn through vast quantities of poor quality prospects. As to our 5,000 perfect prospects, we heard things like, “we’d chew through them in a week.” Note the operative word “chew.”

We have new and better buzzwords now, but the broad problem is the same. Nowadays, when it comes to sales leads, companies are literally feeding the beast in the form of their marketing automation platforms. And everything has to flow through the platform because otherwise reports would be inaccurate and KPIs would be wrong.

Companies today will pay handsomely for qualified sales leads – sometimes up to several hundred dollars per lead. But these top quality leads won’t get treated any better than the mediocre ones. How do I know? Because the marketers spending all these big bucks will insist the leads be formatted for easy loading into their marketing platforms, and I’ve also been told, “we’re not interested unless you can guarantee at least 100 leads per week.” And that’s how far we have progressed in 30 years: marketers have solved the tension between quality and quantity by simply insisting on both. And the pressure to deliver both will necessarily come at the expense of quality. This essential disconnect won’t be solved easily, but when it is, a new golden age of data will arrive.

Looking Ahead: The Application Decade

R Perkins — Fri, 10 Jan 2020 17:14:24 +0000

As I noted in my previous post, the data business was the right place to be in the last decade. Commercial data producers were already well-positioned in 2010. The value of data products was already well understood. The quaint subscription model that data producers had been stubbornly clinging to for years suddenly became all the rage. The birth of Big Data and the growth of data science as a profession put a spotlight on the need for high quality datasets.

From 2010-2019, things only got better as Big Data tools proliferated, the cloud offered cheap, efficient storage, computer processing power continued to increase and we were finally able to build and make effective use of truly massive, multi-sourced databases, many updated in real time.

The advances we have made as in industry in the last ten years have been truly breathtaking. But if the last decade was characterized by a wondrous growth in the accumulation of data, the decade in front of us will be about the smart application of that data.

A picture of what’s in store for us is already emerging. Artificial intelligence will take the data industry to the next evolutionary plane by enabling us to predict buyers and sellers and other transactional activity with confidence and in advance. That’s no small statement when you consider that the vast majority of commercial data products exist to bring buyers and sellers together or otherwise enable business transactions.

Our new decade will also be notable for its embrace of data governance. There simply won’t be any place for poorly managed and sloppily maintained datasets. Those who properly see data governance as an opportunity and not a burden will prosper mightily. And yes, the commercial data business will yield a first-mover advantage, because we understood the power of data governance even before it had a name.

Boil it all down, and my prediction is that we will be entering the decade of data-driven predictions. By 2030, commercial data producers will literally be able to predict the future, at least from a sales and marketing enablement perspective. The new tools required already exist, and they will continue to improve. All that’s needed is the creativity to apply them to the oldest, most basic objective of business: buying and selling. And our industry is nothing if not creative!

Looking Back: The Data Decade

R Perkins — Fri, 03 Jan 2020 17:10:00 +0000

In so many respects, the last ten years can be fairly called the Data Decade. In large part, that’s because the data business came into the last decade on a strong footing. While the ad-based media world was decimated by the likes of Google and Facebook, data companies held firm to their subscription-based revenue models and thrived as a result. And while legacy print publishers struggled to make online work for them, data publishers moved online without issues or complications, in large part because their products were inherently more useful when accessed online. As importantly, data entered the last decade with a lot of buzz, because the value and power of data products had become broadly understood.
At the highest level,, data got both bigger and better in the last decade. The much-used and much-

abused term “Big Data” came into popular usage. While Big Data was misunderstood by many, the impact for data publishers is that for the first time we became able to both aggregate and productively access and use truly massive amounts of data, creating endless new opportunities for both new and enhanced data products.

\While life without the cloud is unimaginable today, at the beginning of the last decade it was just getting started and its importance was vastly underappreciated. But the cloud profoundly altered for the better both the cost and convenience of maintaining and manipulating large amounts of data.

I’d argue too that APIs came into their own in the last decade to become a necessary component of almost every online data business. The result of this is that data became more portable and easier to aggregate and mix and match and integrate in ways that generated lots of new revenue for data owners while also building powerful lock-in with data licensees who increasingly became reliant on these data feeds. That’s one of the reasons that the data business didn’t feel the impact of the Great Recession as severely as many others.

Through a combination of Big Data, the cloud and APIs, the last decade saw incredible growth in collection and use of behavioral signals to infer such critical things as purchase interest and intent, opening both new markets and new revenue opportunities. This of course allowed many data publishers to tap into the many household name marketing automation platforms. Hopefully, companies will someday develop marketing campaigns as sophisticated as the data powering them, as the holy grail of fewer but more effective email messages still seems badly out of reach.

Another fascinating development of the last decade is the growing understanding of the power and value of data. The cutesy term “data exhaust” came into common usage in the last few years, referring to data created as a by-product of some other activity. And just as start-ups once rushed to add social media elements to their products, however inappropriately, venture capitalists now rarely see a business plan without a reference to a start-up’s data opportunity. There will be backlash here as both entrepreneurs and venture capitalists learn the expensive lesson that “not all data is good data,” but in the meantime, the goldrush continues unabated.

Somewhat related to this trend, we’ve seen much interest and activity around the concept of “data governance,” which is an acknowledgement that while poor quality data is close to useless, top quality data is enormously powerful in large part because it can be trusted implicitly. Indeed, if you listen in at any gathering of data scientists, the grousing you will hear is that they see themselves in fact as “data janitors,” reflecting the fact that they spend far more of their time cleaning and structuring data than actually analyzing it.

I can’t close out this decade without also mentioning the trend towards open data, which in large part refers to the increasing availability of public sector databases that often can be used to enhance commercial data products.

In all, it was a very good decade for the data business, a happy outcome that resulted primarily from the increased technical ability to aggregate and process huge amounts of data, growing willingness to share data on a computer-to-computer basis, and much greater attention to improving the overall quality of data.

And the decade now in front of us? Next week, I'll take a look ahead.

Is Time Up for 230?

R Perkins — Fri, 09 Aug 2019 13:43:54 +0000

In 1996, several Internet lifetimes ago, Congress passed a bill called the Communications Decency Act (officially, it is Title V of the Telecommunications Act of 1996). The law was a somewhat ham-handed attempt at prohibiting the posting of indecent material online (big chunks of the law were ultimately ruled unconstitutional by the Supreme Court). But one of the sections of the law that remained in force was Section 230. In many ways, Section 230 is the basis for the modern Internet.

Section 230 is short – only 26 words – but those 26 words are so important there has even been an entire book written about their implications. Section 230 says the following:

“No provider or user of an interactive computer service shall be treated as the publisher or speaker of any information provided by another content provider.”

The impetus for Section 230 was a string of court decisions where the owners of websites were being held liable for things posted by users of their websites. Section 230 stepped in to provide near-absolute immunity for website owners. Think of it as the “don’t shoot the messenger” defense. Without Section 230, websites like Facebook, Twitter and YouTube probably wouldn’t exist. And most newspapers and online publications probably wouldn’t let users post comments. Without Section 230, the Internet would look very different. Some might argue we’d be better off without it. But the protections of Section 230 extend to many information companies as well.

That’s because Section 230 also provides strong legal protection for online ratings and reviews. Without Section 230, sites as varied as Yelp, TripAdvisor and even Wikipedia might find it difficult to operate. Indeed, all crowdsourced data sites would instantly become very risky to operate.

The reason that Section 230 is in the news right now is that it also provides strong protection to sites that traffic in hateful and violent speech. That’s why there are moves afoot to change or even repeal Section 230. Some of these actions are well intentioned. Others are blatantly political. But regardless of intent, these are actions that publishers need to watch, because if it becomes too risky to publish third-party content, the unintended consequences will be huge indeed.

Use Your Computer Vision

R Perkins — Fri, 12 Jul 2019 19:10:14 +0000

Those familiar with the powerhouse real estate listing site Zillow will likely recall that it burst on the scene in 2006 with an irresistible new offering: a free online estimate of the value of every house in the United States. Zillow calls them Zestimates. The site crashed continuously from too much traffic when it first launched, and Zillow now gets a stunning 195 million unique visitors monthly, all with virtually no advertising. Credit the Zestimates for this.

As you would expect, Zestimates are derived algorithmically, using a combination of public domain and recent sales data. The algorithm selects recent sales of similar comparable nearby houses to compute estimated value.

As you would also expect, professional appraisers hate Zestimates. They believe that they produce better valuation estimates because they hand select the comparable nearby homes and are thus more accurate. However, with the goal of consistent appraisals, the hand selection process that appraisers use is so prescribed and formulaic that it operates much like an algorithm does. At this level, you could argue that appraisers have little advantage over the computed Zestimate.

However, one area in which appraisers have a distinct advantage is that they are able to assess the condition and interiors of the properties they are appraising. They visually inspect the home and can use interior photos of comparable homes that have recently sold to refine their estimates.

Not to be outdone, Zillow is employing artificial intelligence to create what it calls “computer vision.” Using interior and exterior photos of millions of recently sold homes, Zillow now assesses such things as curb appeal, construction quality and even landscape; quantifies what it finds; and factors that information into its valuation algorithm. When it has interior photos of a house, it scans for such things as granite countertops, upgraded bathrooms and even how much natural light the house enjoys, and incorporates this information into its algorithm as well.

With this advance, appraisers look very much like their competitive advantage is owning “the last mile,” because they are the feet on the street that actually visit the house being appraised. But you can see where things are heading: as companies like Zillow refine their technology, the day may well come that an appraisal is performed by the homeowner uploading interior pictures of her house, and perhaps confirming public record data, such as number of rooms in the house.

There are many market verticals where automated inspection and interpretation of visual data can be used. While the technology is in its infancy, its power is undeniable, so it’s not too early to think about possible ways it might enhance your data products.