If you're new here, you may want to subscribe to my RSS feed. Thanks for visiting!

Idea for my next iPhone App: Duplicates in your Address Book

March 19, 2011 by Thorsten · Leave a Comment
Filed under: DataQuality, iPhone 

My Tax App “SmarterSteuer” [insert link]  has been in the iTunes App Store for almost a month now. Of course, the sales are not even remotely close to allow me to quit my day job – at 1.59€ and about 25 sold apps I haven’t even recovered the cost for my membership in the Apple development program. Nonetheless, I’ve been impressed with the number of sales and the pretty level number of sales from day to day: The tax app is quite limited to a certain type of user (freelancers that have recently started and are trying to figure out their taxes) and it’s only applicable to Germany (and it’s only available in the German App Store). I’ve been trying to guesstimate the market size for a general purpose app that would be sold worldwide.

Guessing a Market Size

I’ve been trying to find out how many iPhones have been sold worldwide and in Germany. Apparently, Apple releases quarterly numbers (compiled in a nice sales graph by wikipedia) which would put the number of iPhone sales worldwide to about 90 million at the end of 2010. I couldn’t find any matching numbers for Germany, the latest number I could find was 1.5 million at the end of 2009 where the worldwide number was at about 40 million. That would put the worldwide market at about 27 times the German market which is consistent with a 4% market share that admob is estimating for Germany. Let’s work with a nice round factor of 25. 

It’s even trickier to estimate what part of the current German market would consist of freelancers. In the general German population you have roughly 1 million of freelancers (with 81 million Germans total). I’m assuming that that percentage would be mich higher among the owners of an iPhone. Just for arguments sake, I’m working with a percentage of 10% of freelancers in the whole iPhone market in Germany.

If we take these two numbers, we come up with a factor of 250 – i.e. the market size of a general purpose worldwide app is about 250 times as large as the market that I’m currently addressing with my text app. This may not translate to 250 times as many sales or 250 times the revenue, but even 100 times the sale of my current app would be quite impressive.

I’m not sure how if these numbers are reasonable (please let me know if you have better numbers or estimates) – but it’s certainly encouraging enough to contemplate writing a general purpose app that can be used worldwide.

App Idea

I’ve got a few ideas that come close to the description of a general purpose app with a worldwide appeal, but the one that fits best is an app that analyzes the iPhone’s Address Book and looks for duplicates (either created because of synching problems or by manually entering the same person twice). So here is a short description of that idea:

Level 1:

  • look in your address book for potential duplicates
  • visualize the “duplication” in a nice way so the user has a chance to follow up on the results by manually editing/deleting the records

Level 2:

  • offer a way to “automagically” delete complete duplicates (i.e. record identical in all fields)
  • offer a way to merge records (i.e. if the names are the same but each address is different create a record with both addresses so the user can manually edit the “survivor”)

Level 3:

  • add-on services such as address validation or other ideas that come up in the meantime

I think that this may be a valuable app, something that people are willing to pay for. I’m not sure about the effort required to develop such an app, but I’m ready to start. Most of the technology should be pretty straightforward (accessing the address book, comparing records/strings etc.). I’m going to mull this over for a few more days, but I think I’ll get started as soon as I have some time available.

Looking for duplicates: Results of a simple algorithm

February 10, 2011 by Thorsten · Leave a Comment
Filed under: DataQuality 

As a little side project, I’ve been working on analyzing the finishing times in Ironman Triathlons. (If you’re interested, please head over to my Triathlon Rating site. This involves getting race results and trying to match the athlete names in order to figure out different results from the same athlete.

Finding Candidates

In a first, relatively simple implementation, I’ve used Excel to group results from athletes with exactly matching names and only corrected some obvious issues. One example is German umlauts (äöü). There is an athlete named “Mühlbauer”, and there are a number of different spellings (Muehlbauer, Muhlbauer, other strange representations that seem to indicate encoding problems like M¨lbauer). Another typical issues is abbreviations of first names (Timothy DeBoom and Tim DeBoom).

However, this was a completely manual process, and while it’s next to impossible to built a completely automated solutions, I wanted some automated help. So I’ve built a simple implementation looking for duplicates within my data that performs the following checks:

  • Is one athletes name a substring of another athlete’s name? (Example: “Osborne, Steven” and “Osborne, Steve”)
  • Are two athletes names very similar (using the Levenshtein distance)? (Example: “Csomor, Erica” and "Csomor, Erika”)

Both checks are performed matching firstname and lastname with each other and “crosswise” (firstname with lastname and lastname with firstname). (Example: “Gilles, Reboul” and “Reboul, Gilles”) I should add that checks using just these two fields will not be sufficient for a typical business scenario, other fields have to be taken into account (for example birth date or address).

After implementing these relatively simple checks, I found a couple of “pairs”, ranging from pretty obvious to borderline cases (Chris Brown and Christopher Brown could be the same person, but could also be two different athletes). All in all, I was able to identify 11 pairs that are in fact duplicates, representing 1.7% of my 632 athletes. I found this number to be quite surprisingly large – I would have guessed that the number would be smaller: with the small number of athletes, just one person (me) adding athletes to the database and the manual checks I had already performed. The typical situation in a business would be more conducive for adding duplicates (larger pool of records, a lot of people adding data, users not quite as diligent as I thought I was).

Once the pairs are identified, there has to be a manual step to determine if the pairs are indeed duplicates of one another. This is very hard to be done by an algorithm, there are just too many scenarios to consider.

Survivorship

Once the duplicates are determined, the issue of survivorship comes up – i.e. identifying the “best” record that should be used in the future. In a business context, there are some automatic steps that can be performed (for example collecting the different fields that are filled in the records). When having to decide between different values, there may be some more help available for limited areas (for example when identifying valid addresses). But typically, when making a decision which value is right a human has to be involved (Erica or Erika?).

What to do with the Duplicates?

Once the survivor is clear, what to do with the duplicates is still open. In my case, I could just change the race results to point to the “right” athlete record and the other record could be deleted. In business cases, this may be a bit more difficult to do. For example, changing the owner of an account in a banking system is quite an involved procedure (and may even involve some communication with the customer). Also, deleting partners may not always be possible – in the banking example, you probably want to preserve the fact that for a time some other “record” was the owner of an account. In these cases, the minimum you want to do is to “mark” the duplicates in a certain way (so that it is not used anew) and you also want to point to the survivor ( so people know which is the right partner to use and some systems may provide a unified view on these two partners).

Most of the times, correctly updating the system to reflect that duplicates and survivors have been identified is a highly manual process, especially if there is a complicated IT infrastructure with a number of different applications involved. I’m not aware of any general tools that help with this.

Summary

Using my example, a relatively simple algorithm already provided good results in identify duplicate candidates. The percentage of 1.7% that I found in my data is probably a low estimate for bigger, commercial data. I was able to deal with the candidates pretty easily, in a commercial environment this would have been a lot more complicated. 

Steuer App: Almost MVP

February 1, 2011 by Thorsten · Leave a Comment
Filed under: iPhone 

After working on my little side project TriRating over the holidays, I’ve gone back to my iPhone app.

It’s close to being ready to be a “Minimum Viable Product” MVP:

image

Compared to my last blog post, I’ve changed the colors of the interface with a little help from Zsolt Markus (who I contracted through oDesk), but mainly I made the interface work correctly. I think there is very little left before I’m going to release it (little things like formatting of the numbers and remembering the last  values when the app restarts).

It’s not yet the cool app I had in mind, but it’s looking good enough not to be embarrassed by it and I just want to get the app out (and hopefully get some feedback). I’ll play around a bit more to work on gestures and ease of input. For now,  I’ll also have to  work on the non-coding aspects (e.g. companion website, sales copy on iTunes, support infrastructure).

Steuer V0.2 – making progress with my iPhone app

December 28, 2010 by Thorsten · Leave a Comment
Filed under: English, iPhone 

Over my last few train rides I’ve managed to make some progress in making my app a bit nicer.

This is what the old version looked like:

image_thumb2

Apart from a few “under the hood” things (like moving to a new version of XCode and the SDK, so the simulator looks like an iPhone 4 now), I’ve improved on the following items:

  • switched to a table-based view
    This makes the inputs (Eingaben) and results (Ergebnisse) pretty self evident. I’m still going to tinker a bit with the view as there is one row that wouldn’t quite fit on one screen and I don’t want the app to start scrolling.
  • added a “title bar” (called navigationItem in iPhone-speak)
    I couldn’t quite figure it out how to do that for the simple version where I didn’t have to write a controller for the main app screen.
  • ditched the “Berechnen” button
    After changing the inputs, the results are automatically calculated.

Here’s where I am now:image

I think that is looking a lot better than V0.1, but there is still a lot to do – both from a design (colors anyone?) and a functionality standpoint, but at least I’m making some progress. I’m hoping to get a bit more done in the “free” days after Christmas, but it already looks like a lot of days are filled with visits to family and friends, and that is certainly important, too. I’ll keep you posted!

Unit Testing an iPhone App – Not important to Apple?

December 7, 2010 by Thorsten · Leave a Comment
Filed under: English, iPhone 

After the base functionality of my iPhone app was in place, I wanted to have a closer look into unit testing. I try to use unit testing where it makes sense when developing for my customers, so I wanted to do that for my own app as well. I was encouraged that there was some documentation in the Apple developer’s guide, and I was really looking forward to putting the framework to use.

Restructuring my App for testability

Before being able to properly test the tax calculation functionality, I had to a bit of restructuring to do in order to properly isolate the functionality I wanted to test and to separate it from the GUI. I pulled the tax calculation into its own class and made the GUI use this TaxCalculator. This separation of concerns is one of the benefits of using a Test Driven Approach – it forces you to think about testability and separation of concerns if you want to have an easily testable class. The changed were totally unnoticeable to the user, but added some value because the code is now much easier to understand and to maintain. It took me less than an hour of development  time, and that was time well spent.

Initial Problems with SDK 4.1

The next step was reading up on Unit Testing with the iPhone SDK. There is a chapter in the iOS Development Guide that describes how to add tests to your project. From the start, I was getting strange errors even though the tests seemed to pass. As I was developing these tests while riding the train, I played a bit around in order to avoid the problems, but didn’t have any success. I added a few tests for the TaxCalculator, but ran into some more strange behaviour.

Some Progress with SDK 4.2

When I was back home, I googled for my problems and quickly found that it was a problem with SDK 4.1 (thanks to this Stackoverflow question). There seemed to be a workaround in 4.1, but as the problem was fixed in SDK 4.2, I just decided to upgrade my SDK which I wanted to do anyways.

After that, I had the next problem – XCode was telling me that my “base SDK was missing”. Googling for that problem I quickly found an easy fix for this issue as well (the one I used was on John Alexander Rowley’s blog, but there is a number of helpful hints across then net. After that, my tests were finally running without any errors and I was finally able to focus on writing test cases.

However, I soon ran into a lot of problems in properly checking what the tests were doing and if my app was working fine. I’m usually trying to reserve judgment, but in this case I can’t put it any other way:

What was Apple thinking?

Using the Apple supplied Unit Test library, the tests are only run during the build phase. Running tests during the build phase is not a bad idea, it can prevent building an app if some tests are failing. But running it only during the build phase has some serious drawbacks: Messages sent to the log do not show up in the debugger console (at least they show up in the system console), but more important: You can’t debug the tests!! (I should note that there are some descriptions on how to set up a separate target that would allow you to debug the tests, but the procedure is very complicated – like setting the environment variable CFFIXED_USER_HOME to "${HOME}/Library/Application Support/iPhone Simulator/User". I gave up on it after unsuccessfully trying for an hour or so.)

I have no idea how anyone can think that what is currently delivered is good enough to use. No one who has never used Unit Tests and pretty much wants (or has) to use it will endure the procedure to make it work. It just seems to be some item on a list that Apple wanted to tick (“Unit Testing? We have that.”) but wasn’t really interested in. Very “un-Apple” …

Switching to GHUnit

Luckily, there seems to be a good open source alternative to the Apple Unit Testing framework, name GHUnit (named after it’s developer, Gabriel Handford) which can be found at http://gabriel.github.com/gh-unit/index.html. It was very simple to download and add to my project. Within 15 minutes I had my project use GHUnit and I was able to run the previously developed tests within the simulator:

image

It also enabled me to log to the console and debug my test cases in the XCode debugger just like any normal app – without having to jump through any hoops. You may argue for a nicer interface (color anyone?), but that is certainly a minor issue. After the disappointing Apple framework, GHUnit delivered on what I was looking for. I’m hoping that the framework will hold up under all the testing I want to throw at it in the next weeks!

Calculating tax owed – My First iPhone App

November 2, 2010 by Thorsten · Leave a Comment
Filed under: English, iPhone 

As indicated in my latest blog posts, I’ve started to dabble in iPhone development. As a starting point, I have programmed a little app that calculates income tax owed based on money earned.

Use Case

The use case for this app is based on my personal situation as a freelancer in Germany. The situation is probably similar to other countries, and I might build similar versions for non-German markets.

As a freelancer in Germany, when you get paid, there is no tax withheld as it would be when you are an employee. At the end of the year, you have to declare your earnings, tax is assessed and then you have to pay. As this is a substantial part of your earnings (30-40% is a typical number here in Germany), you better make sure that you know how much of your earnings is actually money you can spend. It is also a good idea to put away the money you expect to pay for taxes. (I like to save the amount in a money market account, so I even have a chance to earn some income on it, even if there’s not much to make with the way interest rates are these days.)

So the App basically asks you how much money you have earned in a month and shows you how much taxes you would owe on this amount. Sounds really simple, but nothing with taxes is ever that simple:

  1. There is some additional sums that has to be included in the taxable amount. In Germany, the main one is called “Geldwerter Vorteil” or GWV for short – a monetary equivalent for non-monetary benefits that you get. The “classic” example is a company car that you can use privately or for riding to your place of work.
  2. There are different components to income tax, such as the base rate, “Solidaitätszuschlag” (an add-on income tax that was originally introduced to aid East Germany after unification) or “Kirchensteuer” (“church tax” which is collected by the state and then re-distributed to the churches).
  3. The tax rate itself is not one simple formula but has different formulas for different ranges.

So even if the use case is not all that complicated, it is also not trivial to build a decent app for that.

The App V0.1

Here is a screenshot of the first, simple version of the app:

image On the top, you can enter the money earned (“Einnahmen”) and the GWV. Below the ‘Berechnen’ (calculate) button the different components of the tax and the full amount of tax owed is shown (“Gesamt-Steuer”). In addition, the tax rate and the earnings after tax are displayed.

From a functional viewpoint, there is very little that is missing for the use case I outlined above:

  • Kirchensteuer is dependant on whether you are a member of a church and your location. There has to be some configuration, probably in a new view.

There are a few things that would be nice to have:

  • The last values should be saved and reloaded after the app start.
  • In order to calculate GWV for car use there are some standard rules. Instead of entering the final value, the way this is calculated should be made available .. another configuration view.

After that, the improvements are not so obvious:

  • make the interface prettier (icon, title bar, colors)
  • use better formatting of the values (separators, currency symbol)
  • instead of individual labels and fields try out a table-based view (should look “nicer” and also offer an easier way to integrate the configuration views)
  • explore a subview with ads in order to monetize the app (instead of selling it)
  • explore alternatives for smarter entry of values (instead of keyboard, use dials and/or gestures)

The last item is especially important for me, as I want to develop apps that make good use of the iPhone’s capabilities. In my view, today there is not much use for simple data entry iPhone apps that could just be a web form. In order to delight the user, you have to be able to make things better, easier and unique to stand out from the crowd of apps in the app store. This may be tricky for a “boring” tax app, and you also have to find a balance between “clever” and “easy to use”, but I’m willing to explore these areas and see what I can come up with.

I’ll post some news once I’ve improved the “raw” app I have now.

Discussing an iPhone App: Settlers of Catan

October 4, 2010 by Thorsten · Leave a Comment
Filed under: English, iPhone 

Over the last weeks, I’ve been thinking about starting some development for the iPhone. There are some ideas in my head of what I could do, but nothing specific yet. Only thing that I’m sure is that I don’t want to do another “me too” program, but something that really uses the iPhone’s strengths.

For example, a lot of mobile apps I see just let you enter data on the iPhone using the traditional “keyboard” and “picker” methods. I’m pretty sure that some gestures may be helpful and allow for quicker entry. Also, oftentimes “guessing” from a GPS-location or time-of-day may offer some good default values. I hope that this will be a road that I’ll be able to explore a bit more.

Because of all of this, I’ve been paying closer to attention to things I like and dislike about some of the apps and games I’m using.

I like playing the occasional game on the iPhone, and in the last few weeks I was playing a lot of Siedler of Catan (Settlers of Catan in English). Just to be sure: The adaption of the board game is done quite well, the computer players have different strengths, so the game provides fun and entertainment for a long time. Well spent 4€ (or 5$)!

However, there are a few issues that I have. The start of the game features an elaborate animation .. but there is no way I have found to get straight into the game. So every time you start the app, after waiting a few seconds for the animation to load, you get the start of the animation

IMG_0001 After tapping on the screen, you get another splash screen:

IMG_0007where the game even tells you to tap the screen (“Bitte Bild berühren”). Then you get the main game menu

IMG_0005 where you can hit “Spiel  fortsetzen” to continue the ongoing game.

That’s three useless taps (and probably 10 seconds of time) that I would consider bad style for an iPhone app. (It’s okay for a PC game, but not for an iPhone app that gets frequently stopped and restarted.)

There are a few other examples where a bit more thinking would have helped create a better experience. For example in this screen you have only three options, but the graphics don’t fit the screen so there may be some scrolling involved:

IMG_0003 With just some more work, the graphics would fit the screen. Also, swiping doesn’t really work here, instead you have to use the provided arrows.

I’m sure that these issues arise from trying to be consistent with other versions of the same game, but I would think that some more care should have been taken while adapting the game to the iPhone.

So five stars for the basic game, but only one star for the iPhone adaption.

Have you built your DQ trust today?

May 10, 2010 by Thorsten · 7 Comments
Filed under: English 

For German readers: Es gibt eine deutsche Version dieses Blogeintrags.

In the last time, there have been quite a few posts on using “shame” as a tool for improving data quality. Here are just three from the top of my head:

Picture by Okinawa Soba, taken from flickr with a cc license

I’ve added some comments to these posts that I think that they are missing something. I wasn’t quite able to put my finger to it, not sure how to grab the “missing thing”, not really able to give it a name. In order to move the discussion forward, I’ve decided to go with “DQ trust” and try to explain my thinking a bit more. Let me know in the comments what you think!

The problem I see with the “public humiliation” aspect of what Rob and Jim are describing: It will only work in a certain environment – when the “riot act” gets what I would call a “wink wink, nudge nudge” aspect.  The “culprit” understands why the reaction is coming, but the whole thing is so much over the top that it can’t really be taken seriously. This results in taking the sting out of the “public humiliation” aspect and the riot act achieves its purpose.

In order for this to work, there has to be one of two things: Either you have to be a really good comedian (and I’m certainly not) so that you can spring this on a person you’ve hardly ever dealt with before. If your act backfires, you’ll also have to deal with that person’s boss, and I have found humor to decline when moving up the corporate ladder. Pretty risky to rely on that.

That leaves the second option: Your riot act has to have a background to it, and you must have built a reputation as a fervent defender of data quality in your organization – you must have built a trust in your data quality judgment. This way, a person or his boss can understand that your reaction is aimed at improving data quality, and not at public humiliating data quality villains.

Too often I find that people do not take enough time to build this data quality trust. As they say it takes a long time to build trust, but only a moment to destroy it forever. Here are some ideas of what to do to build the trust:

  • reserve judgment on someone’s actions for as long as possible – try to find out why people do things a certain way before telling them they are idiots
  • admit that you don’t know everything and try to learn constantly by interacting with different people from different departments to get a 360° view on the issues
  • help people to solve their problems – then they will be much more willing to help you when you need their support
  • make sure to explain data quality in terms the person understands – a business user doesn’t care too much about referential integrity unless you can explain how it affects his daily work
  • don’t be too academic in your data quality requirements – it doesn’t make sense to require perfect data quality for data that is never used

Even with this, whenever a new data quality issue comes and I’m shaking my head why anyone would come up with this harebrained scheme, I ask myself whether I’ve built enough trust to shame the person about it or not. Almost always, I come out on the side of caution and try to be firm on the issue, but avoid assigning personal blame. In the short term, this may not be quite as satisfying as “venting”, but has a much better chance of long-term success.

Heute schon das Vertrauen in DQM erhöht?

May 10, 2010 by Thorsten · 1 Comment
Filed under: Deutsch 

Für englisch-sprachige Leser: There is an English version of this post.

In der letzten Zeit wurde einige Blogposts veröffentlicht, in denen ein Pranger als Mittel zur Verbesserung der Datenqualität diskutiert wurden. Hier sind einige englisch-sprachige Beispiele:

  • A Data Quality Riot Act von Rob Paller
  • The Poor Data Quality Jar
  • The Scarlet DQ, beide von Jim Harris auf OCDQBlog
  •  

      Picture by Okinawa Soba, taken from flickr with a cc license

    Bei diesen Beiträge hatte ich das Gefühl, dass die Reaktion zwar menschlich verständlich ist, aber der Sache eher schadet. Um die Diskussion dazu weiter zu führen, möchte ich meine Gedanken zum “Vertrauen in DQM” etwas weiter ausführen und freue mich über Kommentare und andere Sichtweisen.

    Das Problem, das ich mit dem “öffentlichen Anprangern” aus den Posts von Rob und Jim sehe: Ein solches Vorgehen wird nur unter engen Bedingungen funktionieren – wenn das Anprangern mit einem Augenzwinkern erfolgt. Am einfachsten ist es, wenn der “Beschuldigte” von sich aus versteht, was falsch gelaufen ist – dann kann der “Anschiss” so übertrieben sein, dass er nicht ernst genommen werden kann und es so zu keinem persönlichen Angriff kommt, aber der Hinweis trotzdem aufgegriffen wird.

    Damit das funktioniert, sehe ich zwei Möglichkeiten: Man ist so witzig, dass ein “Anschiss” an einen neuen Kollegen zu einem Schmunzeln führt – ich selbst bekomme das mit Sicherheit nicht hin. Zudem birgt es das Risiko, dass das Ganze nach hinten losgeht und man dann sein Verhalten mit dem Vorgesetzten des Kollegen besprechen muss. Nach meinen Erfahrungen werden solche Gespräche immer humorloser, je weiter es in der Firmenhierarchie nach oben geht. Insgesamt sehr riskant, sich auf sein komödiantisches Talent zu verlassen.

    Damit bleibt die zweite Möglichkeit – das Anprangern erfolgt nur in Ausnahmefällen und erst, nachdem man Vertrauen in DQM aufgebaut und den Beteiligten klar ist, das es um die Verbesserung der Datenqualität geht und nicht darum, jemandem die Schuld für Probleme in die Schuhe zu schieben.

    Leider nimmt man sich selten genug Zeit zum Aufbau des dafür erforderlichen Vertrauens, vielmehr kann mühselig aufgebautes Vertrauen mit einer unbedachten Äußerung wieder verspielt werden. Hier sind ein paar Ideen, Vertrauen in DQM zu schaffen:

    • eine Beurteilung sollte so spät wie möglich erfolgen – man sollte erst versuchen herauszufinden, warum Leute bestimmte Vorgehen gewählt haben, bevor man sie als Idioten bezeichnet
    • Man kann gerne zugeben, auch nicht alles zu wissen und laufend dazuzulernen, in dem man mit so vielen Leuten wie möglich spricht, die verschiedene Blickwinkel auf ein Thema haben
    • Unterstützen Sie die Leute bei der Lösung derer Probleme – dann ist die Bereitschaft größer, auch bei DQ Problemen zu helfen
    • Datenqualität muss abhängig vom Zuhörer erklärt werden – ein Fachbereich interessiert sich nicht für referentielle Integrität, wenn man ihm nicht erklärt, was das in seiner täglichen Arbeit bedeutet
    • Seien Sie nicht zu akademisch in Ihren DQ Anforderungen – es macht keinen Sinn, bei kaum verwendeten Daten eine perfekte Datenqualität zu verlangen

    Selbst mit dieser Grundeinstellung muss ich oft erst einmal den Kopf darüber schütteln, auf welche schwachsinnigen Dinge einige Benutzer kommen. Dann frage ich mich aber, welche Reaktion ich ernten würde, den Benutzer damit ein wenig aufzuziehen. In den meisten Fällen gehe ich dann etwas vorsichtiger vor und diskutiere über die Auswirkungen der Fehler anstatt persönliche Schuld zu suchen. Das ist kurzfristig nicht ganz so befriedigend wie “Luft abzulassen”, hat aber langfristig viel bessere Erfolgsaussichten. Wie reagieren Sie in solchen Situationen, welche Erfahrungen haben Sie gesammelt?

    Assessing Data Quality – Improve vs. Maintain

    April 16, 2010 by Thorsten · 1 Comment
    Filed under: English 

    For German readers: Es gibt eine deutsche Version dieses Blogeintrags.

    Last week, I was discussing measuring Data Quality with a customer. For a while it seemed we couldn’t agree on anything, until we realized that we were talking about different types of DQ projects:

    1. A project geared at improving data quality in a specific area
    2. An ongoing effort to make sure data quality stays within accepted levels

    image

    Once we talked about these different types, agreement came very easily.

    Improving Data Quality

    In this type of project, there is an important business reason that requires improving the data quality. Typically, you start with a large number of errors and have to reach a much improved level. In some cases, this level has to Zero, but typically a low number (say, 10) of error cases is acceptable. Examples of this type of DQ project include meeting regulatory requirements or the migration of data to another system.

    This is a project in the strict sense: You have to reach your goal by a fixed date. As often these days, the goal has to be clarified after the project has started. For a data quality project this includes identifying important data areas to be improved, defining rules that the data has to conform and a way of identifying non-complying data. When this step is completed, you end up with a number of DQ Measurands (see my previous post on Describing DQ Measurands) and an automated way of measuring the data quality for each specific measurand. Typical projects I’ve worked on had a list of 20 to about 100 measurands that changed a bit over time, but was relatively stable after the initial definition phase.

    The main questions that have to be answered in this type of DQ project are:

    • Which issues have been raised and which have been resolved? Which do we still have to work on?
    • Are we on track to getting to an acceptable level of Data Quality by the end date?

    Maintaining Data Quality

    In contrast to the “Improvement” type of project, a “Maintain” type does not necessarily have an end date but is an ongoing effort. (It may start towards the end of an improvement project when most issues are resolved and should stay that way until the project ends.)

    Most of the definition work has already been done by improvement projects, and the maintain project “inherits” these results. Again, the number of DQ measurands may be quite high – even higher than in an improvement project, as over time the rules of multiple improvement projects move into maintenance. The data quality is usually at an acceptable level, so the type of questions are different:

    • Have there been changes that require action?
    • How well do the rules cover all the data in the organization?

    Assessing DQ Measurements in different types of DQ projects

    A DQ measurand can be defined without having to take into account what type of project it is used for. But interpreting the measurements has to take the project context into account and leads to different interpretations in order to answer the question. My customer and I are still working on the specifics, but identifying the different types of projects helped us gain a shared understanding.

    Next Page »