albertsun.info

Notes from Three Weeks in Asia

Albert Sun — Sun, 12 Aug 2012 15:22:13 -0400

I just spent the past three weeks in Asia, visiting Tokyo, several cities in mainland China (Shanghai, Wuxi, Wuhu, Nanjing) and then Hong Kong. This was the first time I'd been in China since 2007. My cousin had invited the whole family to Nanjing for her wedding, and my brother and I decided to tack on a few more destinations. China is changing very rapidly and it was my first time in Tokyo. A few things I noticed:

On trains and public infrastructure

The subways and trains across Asia were fantastic and by far the easiest way to get around. The subways all had cell service and TV's and multilingual announcements of all stops. Many stations had doors built onto the platform and air conditioned platforms. Traveling Nanjing to Shanghai on high speed rail (about the same distance as New York to Boston) took an hour and a half from city center to city center.

On bikeshare

Bikeshare was everywhere. Tokyo had it in many places. Shanghai, Nanjing and Wuxi all had it. Wuhu is planning it. Most didn't seem to require any sort of advance sign up, just pay at a vending machine and ride away. In Tokyo I even saw what looked like a Vespa share station.

On restaurants in China

Several restaurants I went to in China have adopted a curious habit of charging extra for napkins and even for plates and dinnerware. Each setting at the table would have a sealed wet napkin and the plate, cup and bowl sealed in shrink wrapped plastic. Opening the wet napkin cost 1 RMB and the plates 2 RMB. The restaurants have outsourced dishwashing to companies that specialize and return dishes shrink wrapped and disinfected and are passing the cost on to customers. Curiously though, every restaurant allowed you to bring in outside beverages, alcoholic and otherwise.

On health and food safety

In China, my relatives were very concerned about the safety of food. Any fresh fruit was peeled, including grapes and peaches. In our luggage, we brought several bags of powdered milk and baby formula for people who had requested it.

On smog

While in mainland China we experienced a string of beautiful hot blue sky days with no smog, rain or clouds. Near Wuhu many factories were shut down due to high temperature. Hong Kong was smoggy while we were there though. The South China Morning Post seemed to be leading a call for warning the public about the dangers of smog.

On shopping malls

There were an absurd number of shopping malls and department stores. And yet they somehow all seemed full of people shopping. The malls are mostly built up vertically, and the basements have giant food courts. In Japan in particular, the department store basements have "depachika", glorious food markets with stall after stall of varied food and pastries.

On suburbs, gated communities and cars

China's cities are expanding and the streets everywhere are clogged with cars, but they still haven't reached U.S. levels of sprawl yet. New construction in the outlying areas of cities are of apartment complexes with multiple high or low rise buildings, parking spaces, green space and maintenance buildings. They look like New York's Stuyvesant Town but on a smaller scale. There's a real bias against buying a "used" apartment. People generally preferring new construction. Many newly constructed buildings provide the apartments unfurnished with no fixtures, appliances, floor boards, etc to allow the new purchaser to install those to their own liking.

On banquets, table seating and drinking

When we went out to eat as a large group in China, we rarely sat in the main dining room of a restaurant. Most restaurants have a large number of private dining rooms of different sizes to accommodate the group. The dining rooms all have a large circular dining table with a lazy susan to hold the food. Even though the table is round, seating matters. The guest of honor sits at the seat furthest from the door and facing it. There is typically some sort of aesthetically pleasing backdrop behind them. That end is the "top" of the table. In the opposite seat, at the "bottom" of the table and with their back to the door, is the person who has invited everyone to dinner, and who will confer with the waiters to order and will pay at the end. For the rest of the seats, people sit from the top of the table to the bottom according to their position.

On Chinese tourists and tour groups

Tourism in China is booming. My aunt owns a travel agency in Wuhu and told me she has dozens of new competitors. Tour groups from China dominated many of Tokyo's tourist attractions with their matching hats and flag waving guides. Within China, tour groups were even more common. In fact, people don't seem to travel on vacation in any way other than a tour group. These tours are organized and paid for by offices for all their employees to go on vacation together, bringing along spouses and children.

On global cultural convergence

Wuhu is a fairly small city by Chinese standards, in a part of the country analogous to the Midwest. In the center of town there is a large pedestrian shopping area, and at night at the center of that shopping area there were high school age kids with a boombox blasting LMFAO's Party Rock Anthem and shuffling. Shuffling pretty darn well too.

Jury Duty in New York City

Albert Sun — Sat, 26 May 2012 10:50:20 -0400

My New York City jury summons instructs me to report to 111 Centre St, Room 1121 on Wednesday May 23rd at 9 a.m. So at about 9:10 a.m. Wednesday morning I arrive to the jury waiting room on the 11th floor of the court building. The waiting room is long, with rows of cushioned chairs and a few small TV's mounted towards the front of the room. The TVs play a video explaining the important role a juror plays in the justice system. Every other seat in the room is filled with one of those important jurors, about 100 in total. Half watch the video, half just fidget with a phone and wait for something to happen. I find myself a seat and start to fidget with my phone. The video ends and we wait some more. After a little while a woman arrives to the front of the room and starts talking. People shout out "here!" as she reads names of people who will be part of the first jury panel. The rest of us wait some more. The waiting area has free wifi and vending machines and some desks with power outlets but not much more than that. People look bored. After more waiting, a second group of jurors is selected. This time I am called and we are led down to the 10th floor where there are benches along a hallway outside the courtroom. We sit and wait some more. After a bit, we enter the courtroom single-file. Waiting for us are a prosecutor, defense attorney, defendant, judge, clerk and a stenographer. After some instructions from the judge, twenty of us are picked to sit in and around the jury box for questioning. As we are directed to our seats, the judge writes our names on a board in front of her so she can address everyone by name to take notes about each of us. A long, long list of questions begins to be asked of us by the judge. She asks about our prior experiences with crime, with police, with the neighborhood the incident took place in, our neighborhoods and professions, our roommates' professions, our education, our hobbies, and what newspapers or magazines we read. We're asked about our interaction with the police and all confess to the date of our last speeding ticket. Some have relatives who have been arrested. Some have relatives who are police. I have neither. Then both lawyers, first the proescutor and then the defense attorney ask us even more questions. They ask about our ability to follow the law, and about how we will interpret whether someone is telling the truth or not and about whether we would find someone guilty or not guilty in different circumstances. Eventually we are released for lunch and told to report back at 2:30 p.m. We scatter out of the room, down the elevator and into Chinatown. At 2:30 p.m. I am back on the benches outside the 10th floor courtroom to wait some more. The court officer calls us back into the courtroom to find out who has been selected for the jury. Of the 20 of us questioned, only four are kept. Another group is seated in the jury box to be questioned. The 16 of us go back to the jury room to wait some more. After a shorter wait we get an announcement that they won't be needing more jurors for the next few days and that our jury service is complete!

Free NYTimes.com Access for Schools and Libraries

Albert Sun — Wed, 25 Apr 2012 18:03:18 -0400

Here at the Times, we've just launched an internal system for sharing ideas and I posted this there. But I figured others might also be interested to hear my case for why the NYTimes.com should offer free online access to schools and libraries. I, of course, have no real influence over this decision. I hope this is in the pipeline already or being considered but I think we should whitelist the IP addresses of these public institutions. When the paywall was launched there was a lot of hue and cry over how we were restricting the public value of our journalism by putting it behind a paywall. There are a lot of people for whom $15 a month is more than they can afford and if we cut them off we become more of a tool purely of and for the elite. Public libraries are important institutions that provide access to information to large swaths of society which are underserved. They often have free copies of the printed New York Times. They should have free NYTimes.com too. Kids in school are unlikely to have any influence on the decision to subscribe to the Times or not and they should have the opportunity to read. Lifelong habits can develop early. I know that when I was a kid and had no control of money what I read and what software I used was purely dictated by what I could get for free. By not letting kids read for free, we risk alienating an entire generation of new readers. Neither of these groups of readers are likely to overlap much with the set of people likely to purchase a digital subscription. And it's unlikely that people who would otherwise purchase a subscription will start trekking to a school or library every time they want to read. Site licenses and group subscriptions might be a good solution for universities or workplaces, but for primary and secondary schools and public libraries it's likely to be beyond their budget or beyond the mind of whoever is in charge of purchasing.

Tool Making is What Sets Us Apart

Albert Sun — Wed, 11 Apr 2012 15:39:29 -0400

A lot of effort at journalism innovation has been focused around the product that our readers experience. People are doing great things to take advantage of the new storytelling forms and new ways of engaging with people that the web browser and the internet have made possible. But I want to turn some attention to the opposite side of things. What about all the myriad tasks that lead up to writing and producing a story that represent most of the work that a reporter does? Where is the innovation that makes that work faster and easier? What tools do people currently use? I would love to read a series of posts similar to News.me's "Getting the News" series but instead "Reporting the News" talking with a variety of different reporters going in-depth about their personal processes for reporting and writing stories. Anecdotally, it seems that most reporters use some mix of the standard email, address book, web search, note taking and writing tools that are available to everyone. But journalism is a specialized process and these are generalist tools. Surely there is room for improvement.

On the Apartment Hunt

Searching for an apartment is New York can be a long and painful process of navigating mercenary real estate brokers and misleading listings on multiple different sites. My two roommates and I have been through this process twice. Two years ago, we kept track of our search with a Google spreadsheet of possible apartments we could find and the status of our contact with each listing. It required a lot of manual work to remove duplicates and update information. This year, we used a new tool called Nestio. Nestio has no apartment listings on it, it's not a competitor for Streeteasy or Craigslist. Instead, it's a tool for people searching for apartments to organize their search. You can add links or use a bookmarklet to save listings to it. Then it goes out and crawls that listing page and saves the photos and structures the information about the listing. You can keep track of when you are scheduled to visit each one and who the contact is for the listing. Through their mobile app you can add additional photos and notes when you visit or update and correct the information that was scraped. And there's a mailer that lets you send a form email to the listing broker with one click and get responses back to your email. Nestio made the search process a whole lot easier because there was a single way to refer to all the information around each apartment we were considering. It's a great tool for organizing information around a single purpose: finding a great apartment. Now where's the equivalent for reporting?

Three Reads on Advertising and Sponsored Posts

Albert Sun — Thu, 05 Apr 2012 21:51:19 -0400

Quick sequence of interesting news to read about advertising. First, Twitter advertising for small businesses. Twitter is allowing advertisers to take their existing Twitter accounts and tweets have them be shown as "promoted" content in the timelines of people who don't follow them. Tweets are algorithmically selected based on which ones people are engaging with and, also automatically, inserted into the feeds of people who will hopefully find them relevant. Second, an interview with Chris Batty, former head of ad sales at Gawker Media who is headed to The Atlantic as the publisher for their planned new business site. Here talking about sponsored posts:

Mr. Batty: I know personally I want to know how big these shale-deposit discoveries are. If you listen to one side of the debate, it solves our energy problems. If you listen to the other, it’s too polluting. Let’s get to the bottom of it. Those are the kind of things that I think digital-publishing platforms can do really, really well relative to other media. We’re going to bring the power of the web to advertisers, not just hoard it for the purpose of aggregating enormous audience and not having a powerful enough ad system to generate the profits we need to reinvest. Ad Age: Is fracking really the right subject to investigate with paid posts written by people with huge stakes in the outcome? Isn’t that much better handled by a reporter without as much of a vested interest? Mr. Batty: Sure and we will do that for the benefit of the audience. But look, Shell knows a lot about the nature of these deposits. Let’s give them the power of our publishing tools to talk to our audience about it with the disclosure that this is Shell.

And finally, a long profile on BuzzFeed, via @zseward.

BuzzFeed currently earns all of its revenue from branded content—a form of advertising in which corporations create story-like units that live among a publisher’s editorial products and share the same underlying aesthetic, tone, and technology. Recent clients have included Kraft Foods, Dell, and McDonald’s.

Taken together, the three pieces linked above point a possible way forward for advertising supported media.

Bypassing the Media

The noise around aggregation and how the internet devalues original reporting misses the point altogether and is irrelevant to anything except authors egos. The real threat to traditional journalism outfits is marketers going direct and bypassing the media altogether. Historically, the high cost barrier of distribution and production of content prevented marketers from taking their message directly to the audiences they wanted to reach. The media choices people could make on any given day were finite and countable. In front of a newsstand, people would pick some number of publications to purchase and read. There was enough time to watch or listen to a fixed number of programs per day. Given that limited set, advertisers were left to buy space for their messages alongside the news articles people wanted to read and in-between the TV and radio programs people wanted to watch. Outside of a few exceptions, consumers wouldn't consciously choose to see advertising. That barrier has now collapsed. In our online lives we make hundreds if not thousands of choices about what media to experience every single day. No one outlet has the burden of providing "completeness." If marketers can create original content that both promotes their brand and is interesting and entertaining, then that content can spread to people through all the same channels that any other news or entertainment content does. Better than obnoxious pushdown banner ads, homepage takeovers and interstitials. What companies have to say is often a part of the news and the public discourse. There are wires over which companies will send press releases and which journalists monitor for story ideas. Spokespeople for companies are often quoted in stories. Why waste a reporter's time rewriting a press release or copying down a company spokesperson's statement? Why not just let them publish those statements directly? Many companies already use their own company blog to communicate very effectively, but most don't have the ability to reach everyone they want to reach whenever they want. Below, the headers from two pieces of content that don't originate from the publication hosting them.

BuzzFeed sponsored content

New York Times Op-Ed Content

Process Post: MetroCard Swipes Project

Albert Sun — Fri, 09 Dec 2011 06:00:00 -0500

One of my last projects for the WSJ was a story and interactive map of New York City showing the usage of different types of MetroCards at different subway stations. I always mean to write process posts, describing how I did things. I think "show your work" is a great idea and have really enjoyed reading other people's posts showing their work. I never got to write one about my foursquare check-ins project and by now I've probably forgotten too many of the details of the process to do a proper write up. Not this time. This'll be kind of a mind dump though.

False start with turnstile data

The idea for this project came from finding the dataset available on the MTA's developer website. In addition to the fare type data we ultimately ended up using, they also make available raw turnstile swipes and that was the one I first looked at. This was last July. The turnstiles data contains cumulative counts of entrances and exists and status report codes for every turnstile in the system every four hours. Then there's another file matching those codes up to station names and subway lines. In a first pass at this data, I made simple line charts of each turnstiles hourly entrances and exits. They were pretty messy and only mildly interesting. I can't find the charts anymore or I'd show them here. Then the project lay dormant for about a year as other news intervened.

Real start, cleaning data

Eventually, the Greater New York section came upon the fare type data posted by the MTA and was interested in running a story based upon it. The fare type data set is a single for each week and records station by station how many times each type of metrocard was swiped. Starting out, we didn't know what the data would show and we didn't have an entirely clear idea of what we were looking for so I decided to just clean it up and play around with the data for a bit to see if any interesting trends popped out. This is not necessarily how I would approach a data project in the future. I think we might've ended up with something even more interesting and more pointed if we had had some questions we wanted to answer with the data to start with. The first step was to import all the data to one place instead of one separate file per week. A quick Python script to put everything into MySQL, then an export to Google Refine to fix inconsistent spellings of station names and then exporting to CSV. Then I thought I'd try a non-geographic visualization. I'd make a grid with the weeks on one axis and the subway stations along another ordered from high traffic to low traffic. At each point in the grid there'd be a pie chart or a stacked bar showing the proportion of each type of swipe at that station in that week.

Using Processing

I decided to first try and do it in Processing as a way to learn more about it and 3D graphics. It came out looking kind of like this.

A sea of little cylinders

So yea. I didn't quite have the scale of the data right. 460 stations by 60ish weeks of data each? Oh that's almost 28,000 datapoints. It was not particularly comprehensible. Maybe it'd be better to try and place stations according to their geographic location instead of in a grid, and then animate over time. Unfortunately, the station names in the fare type files didn't match the station names anywhere else. They were a combination of the names shown on the official subway map, and when those conflicted, an added cross street. The official file of station locations gave station locations by station name and line along with an exact lat lng for each entrance or exit. I matched these up by hand, picking one entrance for each station. The resulting file is here. After a few iterations, I ended up with something looking like this, (using the plate carrée projection, a.k.a x=lng, y=lat)

Each station is a stack of cylinders, larges ton the bottom, with volume proportional to the number of swipes

Kind of cool. Still not exactly easy to make sense of, even though in processing I can adjust the camera and fly around it. Time to try another tack.

Using R

Since trying to visualize the data straight away wasn't working so well, I decided to try and analyze the data in R and find some basic summary statistics. I use R in RStudio rstudio.org which is a really nice IDE for R. I'm almost a complete beginner at R, and it's been really helpful. There's this really cool function, summary(dataframe) that takes some data and prints out a whole bunch of summary statistics of it. So I did:

MTAFARES1108 <- read.csv("~/MTAFARES1108/data/MTAFARES1108_cleaned.csv")
summary(MTAFARES1108)

and got out

      start_date          end_date         REMOTE                             STATION     
 2010-08-21:  466   2010-08-27:  466   R001   :   61   42ND STREET & GRAND CENTRAL:  183  
 2010-11-06:  466   2010-11-12:  466   R002   :   61   23RD STREET-6TH AVENUE     :  122  
 2010-11-20:  466   2010-11-26:  466   R003   :   61   25TH STREET-4TH AVENUE     :  122  
 2010-11-27:  466   2010-12-03:  466   R004   :   61   34TH STREET & 6TH AVENUE   :  122  
 2010-12-04:  466   2010-12-10:  466   R005   :   61   34TH STREET & 8TH AVENUE   :  122  
 2010-12-11:  466   2010-12-17:  466   R006   :   61   42ND STREET & 8TH AVENUE   :  122  
 (Other)   :25535   (Other)   :25535   (Other):27965   (Other)                    :27538  
       FF            SEN.DIS      X7.D.AFAS.UNL  X30.D.AFAS.RMF.UNL  JOINT.RR.TKT       X7.D.UNL    
 Min.   :     0   Min.   :    0   Min.   :   0   Min.   :   0.0     Min.   :   0.0   Min.   :    0  
 1st Qu.:  9517   1st Qu.:  363   1st Qu.:  37   1st Qu.: 128.0     1st Qu.:   1.0   1st Qu.: 3080  
 Median : 16029   Median :  664   Median :  78   Median : 262.0     Median :   5.0   Median : 6226  
 Mean   : 27757   Mean   : 1217   Mean   : 113   Mean   : 412.6     Mean   : 104.4   Mean   : 8999  
 3rd Qu.: 32164   3rd Qu.: 1427   3rd Qu.: 147   3rd Qu.: 490.0     3rd Qu.:  26.0   3rd Qu.:11638  
 Max.   :291172   Max.   :13083   Max.   :1082   Max.   :5062.0     Max.   :7951.0   Max.   :97486  
                                                                                                    
   X30.D.UNL      X14.D.RFM.UNL       X1.D.UNL         X14.D.UNL       X7D.XBUS.PASS    
 Min.   :     0   Min.   :  0.00   Min.   :    0.0   Min.   :    0.0   Min.   :   0.00  
 1st Qu.:  5134   1st Qu.:  0.00   1st Qu.:    0.0   1st Qu.:    0.0   1st Qu.:   7.00  
 Median : 11295   Median :  2.00   Median :    8.0   Median :  108.0   Median :  20.00  
 Mean   : 19833   Mean   : 12.65   Mean   :  325.9   Mean   :  665.8   Mean   :  86.85  
 3rd Qu.: 26114   3rd Qu.: 16.00   3rd Qu.:  226.0   3rd Qu.:  915.5   3rd Qu.:  73.00  
 Max.   :276941   Max.   :251.00   Max.   :18867.0   Max.   :21757.0   Max.   :2371.00  
                                                                                        
      TCMC         LIB.SPEC.SEN      RR.UNL.NO.TRADE   TCMC.ANNUAL.MC   MR.EZPAY.EXP   
 Min.   :   0.0   Min.   :0.000000   Min.   :    0.0   Min.   :    0   Min.   :   0.0  
 1st Qu.:  44.0   1st Qu.:0.000000   1st Qu.:    3.0   1st Qu.:  386   1st Qu.:  17.0  
 Median : 104.0   Median :0.000000   Median :   12.0   Median :  854   Median :  49.0  
 Mean   : 271.2   Mean   :0.006424   Mean   :  280.7   Mean   : 1416   Mean   : 175.3  
 3rd Qu.: 283.0   3rd Qu.:0.000000   3rd Qu.:   69.0   3rd Qu.: 1701   3rd Qu.: 168.0  
 Max.   :3600.0   Max.   :3.000000   Max.   :16197.0   Max.   :21629   Max.   :2890.0  
                                                                                       
  MR.EZPAY.UNL        PATH.2.T         AIRTRAIN.FF      AIRTRAIN.30.D      AIRTRAIN.10.T    
 Min.   :   0.00   Min.   :    0.00   Min.   :    0.0   Min.   :    0.00   Min.   :   0.00  
 1st Qu.:  12.00   1st Qu.:    0.00   1st Qu.:   22.0   1st Qu.:    0.00   1st Qu.:   0.00  
 Median :  35.00   Median :    0.00   Median :   56.0   Median :    0.00   Median :   0.00  
 Mean   :  92.81   Mean   :   34.48   Mean   :  274.9   Mean   :   46.38   Mean   :  13.59  
 3rd Qu.: 113.00   3rd Qu.:    0.00   3rd Qu.:  161.0   3rd Qu.:    0.00   3rd Qu.:   0.00  
 Max.   :1707.00   Max.   :10265.00   Max.   :47909.0   Max.   :17933.00   Max.   :6150.00  
                                                                                            
 AIRTRAIN.MTHLY        total       
 Min.   :  0.000   Min.   :     1  
 1st Qu.:  0.000   1st Qu.: 21130  
 Median :  0.000   Median : 37422  
 Mean   :  1.111   Mean   : 62134  
 3rd Qu.:  0.000   3rd Qu.: 76816  
 Max.   :687.000   Max.   :697709

Similarly, the plot function has a cool default when called on a dataframe that prints a whole bunch of summary plots.

BYDATE <- aggregate(MTAFARES1108[,c(5,6,10,11,13,14,27)], list(start_date=MTAFARES1108$start_date), sum)
BYDATE$subtotal <- rowSums(BYDATE[,c(2:7)])
plot(BYDATE)

Click for larger version

BYSTATION <- aggregate(MTAFARES1108[,c(5,6,10,11,27)], list(STATION=MTAFARES1108$STATION), sum)
BYSTATION$subtotal <- rowSums(BYSTATION[,c(2:5)])
plot(BYSTATION)

Click for larger version

Printed out big, these are kind of fun to look at. Each variable in the data is in a scatter plot with each other variable. You can see some trends in these plots. The usage of full fare and seven-day unlimited cards trending up when the one and 14-day unlimited cards are discontinued. The usage of different types of cards are generally pretty well correlated with others. At PATH stations, only full fare cards are used so there's a set of stations without unlimited card swipes.

More data, mooore data

Select blocks that intersect a 1km radius circle.

Now I already had a database full of census data on population from making Census Map Maker so I decided to bring that in too. I wrote a Python script (GeoDjango script to be exact) to loop through all the 2010 census blocks for Manhattan, Brooklyn, Queens and the Bronx and assign each block to the closest subway station and then calculate the union polygon of the set of blocks for each subway station. Then each shape was assigned the data for that subway. Later I limited each area to also be within 1000 meters of a subway stop to get the final shapes. By doing this, we get the race and income data for each area around a subway. That could be interesting to look at. Exporting back into a CSV file with one row per station and then using R again gives us a couple of charts like these below. Each point represents one subway station area. The y-scale, commuters_percent is people using 30 day unlimited or TransitChek unlimited metrocards.

Call:
lm(formula = commuters_percent ~ whites_percent)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.236574 -0.048477 -0.004978  0.048037  0.198023 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     0.28391    0.00543   52.28   <2e-16 ***
whites_percent  0.15259    0.01207   12.64   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.0689 on 390 degrees of freedom
Multiple R-squared: 0.2905,	Adjusted R-squared: 0.2887 
F-statistic: 159.7 on 1 and 390 DF,  p-value: < 2.2e-16

Call:
lm(formula = commuters_percent ~ blacks_percent)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.191503 -0.052115 -0.001902  0.054073  0.155704 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     0.377349   0.004894   77.10   <2e-16 ***
blacks_percent -0.162457   0.013517  -12.02   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.06987 on 390 degrees of freedom
Multiple R-squared: 0.2703,	Adjusted R-squared: 0.2684 
F-statistic: 144.4 on 1 and 390 DF,  p-value: < 2.2e-16

Call:
lm(formula = commuters_percent ~ median_income)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.179520 -0.058256 -0.002264  0.053546  0.206933 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.823e-01  8.225e-03  34.326  < 2e-16 ***
median_income 1.055e-06  1.411e-07   7.476 5.11e-13 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.0765 on 390 degrees of freedom
Multiple R-squared: 0.1253,	Adjusted R-squared: 0.1231 
F-statistic: 55.89 on 1 and 390 DF,  p-value: 5.109e-13

So these are pretty noisy and I didn't get all that far with an analysis. We ended up not doing anything with these regressions. I'm posting the datafile here in case anyone else wants to play around with it further.

Fare price increase

Next was to try and see if the fare increase made a difference in people's metrocard usage habits. Looking at systemwide usage, there were big dips in the number of swipes in weeks without five working days as so much usage is by people commuting. To smooth things out for comparison, I cut out holiday weeks and then picked as many from before and after as there was still data for. This turned out to be 27 weeks each.

With all weeks included it's really spiky.

With holiday weeks removed, the data looks a lot smoother.

Testing to see if there was a significant change was done like this in R, one t-test at the 99% confidence level for each station and type of swipe. t.test(after$swipes,before$swipes,conf.level=0.99,var.equal=TRUE)

Restrospective

Ultimately, once all the data work was complete building the map interactive itself was fairly straightforward. The part I think worked best was the annotations with descriptions of interesting points to look at. Despite having that though, I still think this project ended up kind of falling victim to the throwing data at readers problem. It could've beneftied from an even stronger narrative strand to tell a specific story. As noted at the beginning, we didn't have a good sense of what story we wanted to tell or what questions we wanted to answer, so in the data analysis I kind of struggled to figure out what was interesting and what wasn't. Someone suggested on Twitter having a feature where users could add annotations or questions to each subway stop. In retrospect, I think this would've been really useful and I wish I had built that in. So many of the stories are probably local and specific to a subway station. And there are probably interesting data anomalies at that level that people could ask about and then we could put in the reporting effort to find answers.

Joining the New York Times

Albert Sun — Sun, 04 Dec 2011 14:49:49 -0500

There's some news I'm very excited to announce — I'll soon be starting a new job as part of the Interactive News Team at the New York Times. For quite a while, the NYT has been setting the standard for the kind of work I want to do. The 2009 New York Magazine profile of the team was in large part what convinced me that a career in journalism might actually be viable for me. So many members of the team I'm joining are people whose work I greatly admire. I've already learned a ton from reading their blog posts, open source code, listening to their conference and meetup presentations and trying to reverse engineer their work. So needless to say, I'm really excited to work with and learn even more from them. In my time at the Wall Street Journal, I've also had the pleasure of working with many talented people. I'm especially grateful for the freedom I had to experiment and try new things. Still, I'm sure this is the right move for me. It's clear that at the highest levels of the organization the New York Times has made the web a priority in a way that the Journal has not. I'm extremely grateful to Aron Pilhofer and everyone else at the Times who I've talked to before now for giving me this opportunity. I can't wait to get started. My first day is December 12th.

Running on Django

Albert Sun — Sun, 27 Nov 2011 20:39:56 -0500

This site is now running on Django with a pretty stripped down amount of code. In many ways, it's going backwards from WordPress, but I see this as a way to keep my Django skills sharp and to try out a few ideas of mine on a site I'm fully invested in.

Lessons from teaching programming

Albert Sun — Fri, 29 Jul 2011 18:13:05 -0400

On Wednesday night I taught a workshop at CUNY Graduate School of Journalism called Intro to Data Journalism with Python. In this class, I tried to teach enough programming to analyze a University registrar's website and find the most popular time slots for classes. The course outline is here on github. I think the class went pretty well, though some people looked bored at the beginning and some left before the end. I think one issue was that when teaching a one time workshop people come in with a range of different levels of experience from beforehand. I haven't done much teaching before and so I learned some things about how I could improve it for next time.

Better communicate the expected knowledge level coming in. The class description should have more clearly stressed that the class was for complete beginners and described what the starting material would be in more depth.
Having an assistant (or a smaller class) would have helped to get people set up and using their computers. At the beginning of the class people needed to open up the command line and navigate to the proper directory. Having another instructor would have made this quicker. And there would be someone to float around the room during instruction to help anyone who got lost get back on track.
In the class, I jumped into the coding part too quickly. In retrospect, people would've learned more/been more engaged if I had gone over more examples of why what I was about to each is useful. Having specific use cases in mind would have made it easier to understand the coding part.
Been more insistent on getting feedback from people about the pace of the class to make sure that people weren't falling behind.
Finally, I should have made people type more and listen less. If I had split up the me talking part with some simple exercises or incomplete programs and asked people to finish them I think people would've been more engaged and learned more.

Anyone else have other tips for teaching programming?

Social media or user engagement?

Albert Sun — Thu, 14 Jul 2011 20:20:24 -0400

Social media has quickly become a major source of traffic for news sites. (See the Pew Research study Navigating News Online published May 2011) People spend a lot of time on social sites and find a lot of relevant news through them. It seems imperative for news websites to "go where the readers are" and engage with them through social media. All the major social networking players even have special media relations teams to help news brands use their networks to the fullest. Initial steps into using social media usually seem to succeed with increased site traffic in a big way. And newsrooms have before been faulted for failing to innovate enough and embrace the web. So it seems they should jump into Twitter and Facebook with both feet and join the modern web before it's too late.

Not such a good idea?

By investing effort into Facebook and Twitter, news sites give the social networks more mainstream legitimacy and consequently more new users. And by easily making news available on social networks users become more locked in to those platforms. By all means reporters and editors should be using social networks to find sources and do the business of reporting and spreading a story. But for any specific social networking site to become a major part of a news website's strategy is giving up too much control of the reader relationship and could be a dangerous mistake. If users always interact with a news site through Facebook or Twitter, then that news site is at the mercy of the platform and a small algorithm tweak could easily send all that traffic to a competitor. The interests of profit-seeking tech companies are at best orthogonal to those of any media company. Depending on their platforms to engage with readers would turn a news organization into a sharecropper, putting in journalistic effort but letting others reap the majority of the rewards in exchange for a pittance of pageviews. To thrive, news sites need to own their reader relationships with social networking sites playing a secondary role. The user experience should be such that they are not substantively harmed if the social networks were to disappear (or change the rules) the next day. The key concept is lock-in. Is the news site building user engagement in a way that increases a user's lock-in to the news site more than their lock-in to Facebook and Twitter? If not, then it's probably a mistake For an elections news app, it may be smart to use Facebook to provide recommendations to a user based on their friends. But the apps core manner of engaging with the user should be something independent, like the ability to pick candidates or races of interest to follow. Print publications have long known the value of a loyal, locked-in audience of subscribers. A successful online strategy will be one that focuses on user engagement and making the news site irreplaceable for users. Social media is then just another customer acquisition channel to bring new readers in. This is a topic I've spent a lot of time thinking about and talking with people about (including one long discussion on a rainy hike in the south Jersey Pine Barrens) but this is the first time I've tried to set the ideas down in a fixed form. I did write about social engagment for news sites from a paid content perspective a year and a half ago.

Teaching a class

Albert Sun — Mon, 11 Jul 2011 18:26:32 -0400

So I'm teaching a class on "data journalism" and Python at CUNY Journalism School. This will be the third time I've done this particular session. First at the CMA College Media Convention in March and then at BCNI Philly in April. The session at CMA went pretty well, but it was much better at BCNI because all the attendees had computers and could follow along and so that's the way I'll be doing it this time. The code I taught with before is here, if you're curious.

Measuring "casual" website visitors

Albert Sun — Wed, 15 Jun 2011 14:43:35 -0400

One widely adopted kernel of wisdom about news online has become that the vast majority of traffic to a news site is made up of "casual" visitors or "fly-bys" that visit just once or twice a month. I think measurement error might be driving this statistic far higher than reality. I'm reading Matthew Hindman's report for the FCC on local news consumption (summarized and linked to from here) and it again repeats this observation. My roommate has a habit of clearing his browsers cookies and all private data every time he closes it. Yet, he basically visits the same set of news sites every single day. If these sites are using cookies to track his visits, as is the standard way, they are over counting there visitors number for him by 30 times. Let's do some rough math to observe how much impact this could have on the results of a study of that data. Let's assume we have a site that has measured 130 unique visitors at an average of 10 pageviews per visitor for the month. In total they've got 1,300 pageviews. If 1% of their visitors browsed like my roommate did, they would actually have only 100 unique visitors, and each person would have 13 pageviews for the month. What if 2% of people did it? Then the average pageviews per person soars to 19. Maybe news visitors aren't so disengaged after all.

The most interesting parts of "War at the Wall Street Journal" (to me)

Albert Sun — Mon, 21 Feb 2011 00:16:49 -0500

Beyond the story lines about Murdoch and the Bancroft family, and Marcus Brauchli and Robert Thomson, Sarah Ellison's "War at the Wall Street Journal" has an interesting story line about what had made the Journal unique before the takeover and about a newspaper trying to adapt to the Internet.

About being a "second read" paper

The notion that the Journal could be a second read, famously espoused by the legendary midcentury Journal editor Barney Kilgore, was no more. No one had time to read two publications. And anyway, Murdoch didn't want to be second at anything. As smaller papers around the country faltered, Murdoch wanted to pick off their readers.

-- War at the Wall Street Journal, by Sarah Ellison. page 199

About "Journal 3.0"

[Publisher Gordon] Crovitz decided he would call the new iteration of the newspaper "Journal 3.0." He arrived at the name &em; never popular in the Journal's newsroom or executive floor &em; by taking particular note of the Journal's lead front-page story the day after Japan attacked Pearl Harbor: "War with Japan Means Industrial Revolution in the United State" read the headline. The story outlined the implications of the attack on the country's economy, industry, and financial markets. For Crovitz, it also marked the end of the first phase of the Journal &em; "Journal 1.0," the time between the paper's founding in 1889 and December 5, 1941. During that period, the Journal reported the news like any other outlet. After that headline and under Bernard Kilgore, who became the paper's managing editor the year of the Pearl Harbor attack, the Journal started adding more analysis to its stories and expanded its coverage beyond business and finance. Crovitz defined "Journal 2.0" as starting on December 8, 1941. He planned for it to end of December 31, 2006, when he would usher in the paper's third phase.To compete against the immediacy of the Web, Crovitz wanted the paper, instead of running stories that rehashed what people had learned the day before on their BlackBerrys, to become more analytical. Journal reporters would break news on the Web site and then examine it in the next day's paper.

-- War at the Wall Street Journal, by Sarah Ellison. page 51

About the morning news meeting

Following the Journal's tradition, the editors wouldn't talk about the biggest news of the day. Unlike every other newspaper in every jurisdiction of every country in the world, the Wall Street Journal didn't put news on its front page. The paper relegated the biggest news stories to the inside of the paper, on page A3. Epic features and investigations for Page One were mapped out weeks if not months in advance. Because of this Journal peculiarity, the morning news meeting was not a frenetic debate about the most disastrous or dramatic news events, but rather a mannered recitation of the day's "sked" of stories. In a business of attention-grabbing headlines and color photos, the paper treated its front page like a quiet haven for reflective storytelling. Breaking news was important, and the paper did plenty of it, but the craft of feature writing was the center of the paper's identity.

-- War at the Wall Street Journal, by Sarah Ellison. page 48

About "the pack"

[Murdoch] wanted the Journal to lead the media pack. It was antithetical to the Journal ethos. "Even if you're leading the pack, you're still part of the pack," Peter Kann, the Journal's former CEO, liked to say. "If there's something everyone is talking about, that should be on the front page of the Wall Street Journal," Murdoch told his aides.

-- War at the Wall Street Journal, by Sarah Ellison. page 170

From Print to Portal: More Online News Pricing Research

Albert Sun — Thu, 20 May 2010 14:43:28 -0400

Some classmates of mine at Penn recently finished a class on Pricing Strategies in the Marketing Department taught by Professor Z. John Zhang who studies such things and they've written a paper named "From Print to Portal: Pricing Strategies in the Online News Realm." They've kindly given me permission to post it online and share it so go ahead and check it out here. (PDF Link) They give a history of the topic and discuss what many companies are doing now. In the conclusion they suggest that news sites should adopt hybrid subscription models. The paper is a good qualitative treatment of the subject and a fresh take from some people not personally invested in the subject. This was a final paper for the class, and from what I know, none of the five team members have ties to or have worked in the industry.

Goodbye Dear Penn!

Albert Sun — Wed, 19 May 2010 22:23:35 -0400

I am officially a graduate of the University of Pennsylvania. This infographic I made for the DP does a fair job of summing it up.

My Senior Thesis

Albert Sun — Tue, 04 May 2010 21:34:08 -0400

A Mixed Bundling Pricing Model for News Websites

Abstract: This paper outlines a method for finding revenue maximizing mixed bundling prices for news websites. This can help better understand paid content strategies for online news content. Drawing on work in the field of bundling information goods, I apply a two-parameter model of consumer preferences to web site traffic data and a roughly estimated willingness-to-pay curve. We can then calculate revenues for different price points and find the optimal one for any given site. This method is applied to a sample of ten sites. At revenue maximizing prices, the majority of paid revenue for these sites comes from the sale of individual articles, rather than subscriptions. Site traffic showing highly loyal consumers is found to correlate with higher subscription prices. This model suggests that while it is possible for overall revenue to be higher with a paid content plan, total traffic will certainly fall. It can be found online here in PDF form. I'm mostly happy with the way it turned out, though there were a lot of compromises and broad assumptions needed to bring it to a finished product. There's so much interesting material in this field, I wish I could spend a few more years studying it. I guess that's what graduate school would be, if I ever decide to attend. Special thanks go out to Aleks Jakulin for supporting and encouraging me in this work.

Short-form blog

Albert Sun — Wed, 31 Mar 2010 14:01:12 -0400

I don't post here nearly as much as I should because I've set a precedent of long posts that take a lot of effort and I don't want to muddy up the stream with little stuff. I know I've also promised posts that I haven't delivered on. They're coming (I hope). But meanwhile, I will blog in short-form, and a little more personally, at http://albertsun.posterous.com/ to keep things flowing.

Economic Analysis of the New York Times Paywall

Albert Sun — Tue, 26 Jan 2010 18:40:14 -0500

After the New York Times announced its metered paywall last week there has been a lot of empty blather. Standing out from all the noise are two very good analyses. The first was by Felix Salmon for Reuters, analyzing a consumers decision of whether or not to pay. The second one was by Jonathan Stray on Nieman Lab, showing the effect of several different variables on revenue. This stuff is right up my alley, and I'm currently working on a senior thesis in the field and so I'll try to extend Salmon's analysis a little bit. Later on, I'll take on Stray's model as well.

Salmon's Analysis

Let's say a reader in a given period reads $latex N$ articles from the New York Times. Then suppose the New York Times sets the paywall after a consumer has read some $latex n F$. This is a good simple model synopsis.

Article Values are Different

Let $latex n,N,F$ be as before. The first issue that jumps out is that the value of any given article is not constant. The value of articles over a period varies, so let's arrange them in order of value from highest to lowest. Let $latex \{v_i\}_{i=1..\infty}$ be a monotonically decreasing sequence of article values for our reader, with $latex v_i = 1 \:\forall\: i>N$. Then the reader gets value, [latex]u(v)=\left\{\begin{matrix}\left (\sum_{i=1}^{N}{v_i} \right ) - F &if\;\sum_{i=n+1}^{N}{v_i} > F\\ \sum_{i=1}^{n}{v_i} &if\; \sum_{i=n+1}^{N}{v_i} \leq F \end{matrix}\right.[/latex] The reader would clearly choose to read the articles he values most first, and after that only pay the subscription if the rest of the articles he has yet to read are still valuable enough. Only if $latex \sum_{i=n}^{N}{v_i} > F$ will the reader pay the fee. But this is not quite right either. There's no way for a reader to know ahead of time which articles are most valuable to him.

Predicting future value

Now, instead of ordering the values of articles from highest to lowest, let's say that the value of articles our reader reads are drawn independently from a probability distribution. Let the value of articles be a random variable $latex V \sim N\left ( \mu,\: \sigma^2 \right )$ with a normal distribution and $latex \mu_x$ the average value of an article. $latex V_1, V_2, V_3,\cdots$ are the value of the first article read, second article read, etc. Let the period of time for which the reader pays be represented as $latex \left [ 0,1 \right ]$, and the moment when the reader has read $latex n$ free articles and must choose whether or not to pay the fee be at time $latex t\in \left [ 0,1 \right ]$. Assume the reader reads articles at some constant rate $latex r$ throughout the entire period. Then $latex t= \frac{n}{r}$. Now the reader must predict what the value of articles he will read will be to determine whether or not he should pay the fee. Up to point $latex t$, he has gotten value $latex \sum_{i=1}^{n}{V_i}$ and average value per article of $latex \overline{V}= \frac{\sum_{i=1}^{n}{V_i}}{n}$. $latex \overline{V}$ is also the sample mean of the distribution.

Result

Our reader will choose to pay the fee if $latex ( 1-t ) r \frac{\sum_{i=1}^{n}{V_i}}{n} > F$. As $latex r$ goes up, so does $latex F$ and as $latex n$ goes up, $latex F$ goes down. There are some interesting suggestions from this. When the New York Times imposes the paywall, they should carefully monitor the rate at which people read its articles. Those that have a low rate would be ideally suited for targeted discounts. Also, since readers make their predictions based on past articles they've read, the ideal time to convert non-paying readers is right after a reader reads a series of good articles. If the Times can be subtle about dialing up and down $latex n$, then they can exploit variance in article value to increase sales.

Further work

This analysis is of course still incomplete. Problems I still see with it.

Knowing that you'll only get a limited amount of articles for free will change a reader's behavior. If they're still uncertain about whether or not paying the fee will be worth it, they will more carefully pick which articles they read before time t. This will bias $latex \overline{V}$ upwards, but push $latex r$ downwards. At time $latex t$, there will also be a back-log of articles that would have been read but weren't influencing the decision of whether to pay $latex F$ or not.
How will the reader decide whether or not to read an article before time $latex t$? He'll have to depend on the headline and a summary if available to make a prediction. Before actually reading the article, the reader will predict some value $latex V_{i}'$ and after reading the article realize some value $latex V_i$. This average spread $latex \frac{\sum_{i=1}^{m}{V_i-V_{i}'}}{m}$ will likely affect predictions of future value.
As is, the model says decreasing $latex n$ and increasing $latex F$ leaves the reader's decision of whether to buy unchanged. But as $latex n\rightarrow 0$ this becomes a strict paywall, which the gut says people would be less willing to pay for. Another factor in the reader's decision of whether or not to pay is their confidence about their decision. The larger $latex n$ is the more confident they will be about their value prediction since the sample mean's standard deviation will fall, as $latex \overline{V} \sim N\left ( \mu,\: \frac{\sigma^2}{n} \right )$.
Paywalls, as described by the New York Times and as currently implemented by the Financial Times and WSJ, are easily bypassed. This can be done either by spoofing the referrer header, or by clearing cookies. This avoidance could also be modeled in in some way.
Letting people in for free if they come via social media or links from other sites screws everything up. I think this may turn out to be such a huge gaping hole in the paywall that they severely restrict it, but if they don't there are several ways it can be modeled. You could divide articles between different distributions of those that are primarily found through social media and those that aren't. The reader would choose whether or not to pay based on the value of those that aren't. Alternately, an article's ability to be found through social media could just affect its $latex V_i$.
Print subscribers get free access as well. In Salmon's post he looks at $latex P-F$, the difference between print subscriber's fee and online subscribers. If this is less than the value of getting the print paper then the reader will choose the print subscription.
What if users can choose between a short period, and a longer period with a discount? What does the renewal decision look like?

There are undoubtedly more things that can be done with this model. One of the most obvious is to try and figure out what $latex n$ and $latex F$ should be set to.

Finding good values for F and n

Since it's reader's will not have the same distribution for $latex V$ it would be theoretically ideal to pick values for $latex n$ and $latex F$ individually for every reader. Realistically, the New York Times probably shouldn't be that opaque about their pricing as it would cause confusion and a negative reaction among readers. If forced to pick a single price, it would be necessary to find the average value of articles for all readers. That's what Stray did with his paywall simulation. However, part of the reason that simulation has such wild swings in revenue from relatively small changes is because many of the variables are dependent on each other. For example, the percentage of people who pay for a subscription does not stay constant when $latex n$ or $latex F$ change. I'll tackle this issue more in my next post.

Special Bonus! A pricing algorithm for the FT

This part might still be a bit half baked, but working backwards from the consumer's decision, it seems possible to figure out a demand curve for each individual piece of content if enough data is available. Since the Financial Times already has a metered subscription plan, if they've been good about collecting user data they should have what's necessary to do this. Here's an outline of the method. It requires some change of notation from the above. Let $latex a_i \in A \;\forall i\in\mathbb{N}$ be an article, and $latex x_i \in X \;\forall i\in\mathbb{N}$ be a reader. We will now represent the value of an article to a reader as a mapping $latex V: A\times X \mapsto \mathbb{R}$ with $latex V(a_i,x_i)$ representing to the value of article $latex a_i$ to reader $latex x_i$. The functions $latex F(x_i)$ and $latex r(x_i)$ replace $latex F$ and $latex r$ as the fee and rate for reader $latex x_i$. $latex n$ is as before. Define the set $latex R(x_i)$ such that $latex a_i \in R(x_i) $ iff $latex x_i$ reads $latex a_i$ before deciding whether or not to buy. So our former equation $latex \left ( 1-t \right ) r \frac{\sum_{i=1}^{n}{V_i}}{n} > F $ becomes $latex \left ( r(x_i)-n \right ) \frac{\sum_{a_i \in R(x_i)}{V(a_i,x_i)}}{n} > F(x_i) $. Rearranging, we get $latex \frac{\sum_{a_i \in R(x_i)}{V(a_i,x_i)}}{n} > \frac{F(x_i)}{r(x_i)-n} $. The left side of the above equation is the average value of an article that a reader reads before making the buying decision. So if $latex x_i$ does buy a subscription, we then know that the average value was at least the right side. Now that we have an estimate of a given readers average value for content we want to estimate that value across all readers. For any given piece of content, some fixed $latex a_i$, to determine its value we sum the average value for content of all readers who read $latex a_i$ before purchasing, and then divide by the total number of readers (who aren't already subscribers) who've read $latex a_i$. Define, [latex size="2"]\overline{V(x_i)}=\begin{Bmatrix}\frac{F(x_i)}{r(x_i)-n} &,\:if\: \frac{\sum_{a_i \in R(x_i)}{V(a_i,x_i)}}{n} > \frac{F(x_i)}{r(x_i)-n}\\ 0 &,\:if\: \frac{\sum_{a_i \in R(x_i)}{V(a_i,x_i)}}{n} \leq \frac{F(x_i)}{r(x_i)-n}\end{Bmatrix}[/latex]. Equivalently, [latex size="2"]\overline{V(x_i)}=\begin{Bmatrix}\frac{F(x_i)}{r(x_i)-n} &,\;\text{if x buys}\\ 0 &,\;\text{if x does not buy}\end{Bmatrix}[/latex]. This function $latex \overline{V(x_i)}$ is an estimator of the average $latex x_i$ has for an article. Now define the set $latex S(a_i)$ such that $latex x_i \in S(a_i) $ iff $latex x_i$ reads $latex a_i$ before deciding whether or not to buy a subscription. This set is all non-subscribing readers that read article $latex a_i$ in the current period, whether or not they've ultimately paid for a subscription by the end of the period or not. If we take $latex \overline{V(x_i)}$ for each $latex x_i$ in the set $latex S(a_i)$, we have a distribution of estimated values for article $latex a_i$. That might look something like this. Finally, to come up with a set value for a specific piece of content, we sum over the entire set and divide by the number of readers. $latex P(a_i)= \frac{\sum_{x_i \in S(a_i)}{\overline{V(x_i)}}}{\left | S(a_i) \right |} $ With this value, you can now derive a demand curve for the entire site. Or you can dynamically set prices based on what articles a reader has viewed before hitting the paywall. Exciting stuff, if actually implemented. If you think I've screwed up the math in some way, or if anything isn't clear, please please let me know. The thoughts in this post are still very much a work in progress.

Social Over Search Could Monetize Content

Albert Sun — Sun, 03 Jan 2010 20:30:23 -0500

Hope everyone has had a good holiday! Did you buy gifts for people? Or receive them? From a pure economic efficiency perspective you shouldn't spend a second on holiday shopping and just give those around you cold hard cash. After all, lump sum transfers are most efficient in redistributing wealth, and there's no way you could know a recipients preferences better than they do. (Exceptions for parents who've received letters to Santa of course) Unfortunately this is empirically very unpopular (in the US at least). As Greg Mankiw explains, it's about signaling. Social interactions aren't just about exchanges of economic value, they're an intricate dance of signaling to others, posturing to others, projecting an image, fitting in with the cool crowd, and all manner of things a high-schooler would be embarrassed to admit to. The content we consume becomes part of that intricate social game of posturing and positioning that we all play to navigate our social spheres. And luckily for content publishers, this all can be exploited.

Social Content

2009 may be remembered as the year that Facebook and Twitter topped the search engines in sites' referrer logs. Increasingly, people are turning back to their friends and acquaintances to point them to accurate information, interesting news and entertainment. Search has become the victim of spam and gaming of algorithms, leading to online social spaces becoming a more reliable way of finding what is good online. Fred Wilson gives a clear summary. For me, social has become the definitive content discovery mechanism. As this happens, content begins to look less and less like a commodity. In search engine land, content is ruthlessly and algorithmically ranked and rated in value and relevance. No one has a personal relationship with a search engine. Information with social context becomes part of our relationships with other people. And no one is better at persuading us to do things we might otherwise not do than our friends. For more evidence that social interactions destroy rational economic thinking, one need look no further than luxury goods that people consume and enjoy more simply because they are expensive and can signal to others that they are of high status. Demand for these goods actually rises as they become more expensive. Or take the vicissitudes of fashion, which require people to swap out their wardrobes on a continuous basis in order to convey their social status to others.

Applications

How to tap into this economic irrationality? Get people to consume goods in a social space that displays their consumption to others. People do this on the subway by holding open the covers of their magazine or book. People do it online by sharing links to their favorite sources and stories. More specifically. Publishers could erect a paywall on a site that takes advantage of these social tendencies by giving paying subscribers the ability to share content by whisking their friends and followers past the paywall. With customized links, or a custom link shortening tool that meters the number who use the link it would be possible to set up multiple tiers of sharing and charge more to let someone share with more people. This also lets people band together and buy subscriptions. It sounds bad for the bottom line, but in the economic literature, its been shown that allowing this kind of sharing, can actually improve profits for a content distributor. What causes this counter intuitive result? In a paper by Bakos, Brynjolfsson and Lichtman, they explain that this kind of social sharing works to aggregate widely distributed consumer valuations for content and present a more favorably shaped demand curve to the producer. Content producers continue to struggle with monetizing their products online, and it looks increasingly unlikely that advertising will be the solution because content sites operate at the level of "intent generation" not "intent harvesting". Perhaps by making the consumption of content more of a social experience, producers will have more success.

Murdoch's Google Bluff/Threat/Stroke of Genius

Albert Sun — Fri, 04 Dec 2009 08:01:43 -0500

I've so far refrained from commenting on the Rupert Murdoch de-indexing comment and ensuing brouhaha. But Google's recent policy change throws everything into the air. Whether you like or dislike him, it's time to stand up and recognize that Murdoch's threat to pull News Corp sites from Google's index has worked brilliantly. Publisher unrest and the threat of a Bing deal and serious search engine competition on site indexing have pushed Google into a major concession. Google's change to its First Click Free guideline is a bigger deal than many people realize. What appears to be a simple change in degree is actually a change in kind. Google has now said that its okay with sites showing different content to its crawler than to a human following a search results link. There is no longer a guarantee that what shows on a search results page will actually be on the destination page. The Google search user experience will suffer slightly and publishers will now find it much easier to run a pay site.

A Little Background on First Click Free

I've seen First Click Free described by some bloggers as a Google "program" or "service". It's neither. It's more accurate to call it a guideline or policy. Google has always taken a strict stance against "cloaking", or showing different content to its crawler than to a human visitor. What First Click Free said to publishers of paysites was in effect that they had three options. (1) Opt out of indexing at all. (2) Let the crawler index all content, but direct a human reader to a sign-up page, and risk the wrath of Google, which could include de-indexing or ranking penalties. (3) Implement First Click Free, and check all incoming requests to see if they're the Googlebot or have a Google referrer and show them the content for free. (This is what WSJ.com implemented) Now with "First Five Clicks Free", Google has given sites permission to not show a user the same content as the Googlebot sees after their fifth click.

First Click Free Created the Leaky Paywall

I had never understood the complaints about search engines "stealing" content that emanated from the top of News Corp. If anything, search engines were providing free advertising and new visitors to convert to paying subscribers. Pulling their sites from Google's index wouldn't hurt Google, and it wouldn't help the site either. I thought that the Journal was allowing visitors from Google past the paywall voluntarily to increase traffic. Now its clear that the Journal was choosing between maintaining the loophole and violating Google's rule against cloaking and risking losing Google derived traffic. In that context, the ire directed at Google makes much more sense. Editors and staff at the WSJ are well aware of both the power of Google to drive traffic and visitors to the site, and the degree to which people were using it to circumvent their paywall. Every morning, an email report goes out to editors and staff detailing what search keywords were driving traffic to the site and what stories and trends are hot online. During my internship, compiling and writing this email report was one of my responsibilities. Visitors searching for the exact headlines of Journal stories often ranked among the top sources of Google referrer traffic. That Google has so clearly and quickly reacted, means that some negotiating power is returning to the big publishers. Five free clicks per day is still probably too many to make them happy, though. But the more search share Bing gains, the more leverage publishers will have.

Predictions for the Future

I predict that we will soon see a future where major publishers will let search engines see and index the full text of a story, but show just a teaser and a "Purchase" button to users. In fact, paywalled sites could try it now, if they feel like playing chicken with Google. Would they actually follow through and penalize a sites ranking or de-index it? Especially for a site like the WSJ.com, it's plausible that doing so would noticeably hurt the quality of web and news search results. If Bing doesn't penalize a site for doing so, will their results look better in comparison? By explicitly ignoring Google's guidelines, publishers would throw the ball back into Google's court to see how they'll respond. "First Five Clicks" is a sign that Google may cave on this. My advice to Rupert Murdoch would be to patch that hole in the WSJ.com paywall (give away maybe one free click per day) and see what Google does.