taowen

Architecture Fundamentalism: Review of My 2016

2017-01-25T01:08:00.003+08:00

I am writing code everyday, I thought I will never categorized myself as "unpractical" or "theoretical". However, after reviewing the learning journey of my 2016, I invented a word "Architecture Fundamentalism".

Architecture Fundamentalism: applying principle or architecture without context

That is exactly what I did, in retrospect. It is too hard to say it in my mother language, so I write my retrospection in English.

I learned the following experience in the hard way.

Idea 1: Adding test to regain confidence

The idea is simple. Test is important for agility. If you want people to have confidence and do more refactoring, you should have test first. So, let's add test to a big legacy codebase. In hope, the developers can appreciate the test, and regain the confidence when making changes. Isn't it a good idea? Is it working? Not really...

Idea 2: Integration test over mock

Having stayed in a company which invented xxxMock (like 10) frameworks, I had enough bad experience when the build is broken it tells you nothing but the mocking behavior has changed. I hate mocks, I do not deny. Let's embrace the great grand integration test. Run all the components within a bounded context, and test them together. The insight is: the boundary between client/server should be stable. We write the test simulating the client, it should produce very stable result between releases. What went wrong? A lot of things...

Idea 3: Unify the middleware

The technical stack is a disaster. There are 5 langues, 8 frameworks, countless libraries being used in production. If we want to do distributed tracing, service discovery, traffic coloring, shouldn't we unify the stack? Like every successful company is backed by one core RPC stack. Shouldn't we do that? Yes we should. Is it the only way and best way? Maybe...

Idea 4: Decoupling system by event

RPC is evil, and fragile. Fred George, a funny old thought leader. What I learnt from him and many other big bloggers, don't model your system using RPC. The system should be divided into bounded context, and they should be integrated with messaging system. Loose coupling, those idiot can never get. Turns out, I am the moron who blindly trust an architecture, without proving it is fit in the context. There are simply too many unanswered questions, to replace RPC with event based architecture.

===

All the ideas are attractive at first look. Now, let me summarize what actually went wrong. And what is the practical solution (maybe quick & dirty) that actually worked.

Experience 1: Small flow in production trumps everything

TDD is dead. Don't get me wrong, test is a good thing. It is like daily physical workout, it keeps your body fit. However, what most people want is just a system not bleeding all the time. When you actually have a system at the edge of crashing every day, adding test should NEVER be your first priority. The golden tool in the internet business: small flow in production. Nothing is more convincing than it is actually working in production.

Isn't we are already doing small flow release every day? Turns out, we are not doing it enough. With everything coupled in one module, you only can at most do 4 small flow release, with 4 hour each to verify it is actually working. This is not enough, given there are so many changes coming into the big ball of mud. The most important result of SOA decoupling, is not blah, blah, blahhh. It is allowing you to have more modules to do small flow release. This way, we can multiple module doing online testing, giving the new code more time to test itself.

Isn't the test irrelevant? NO. It is still a good practice. When keep a system from crashing is no longer your daily business. Keeping your developer happy is important then. Modifying a PHP source file without the ability to run in test environment, only to find out you mis-spelled a method call causing PHP error in production is NOT a fun experience. Even we have small flow to contain the damage, the release/roll back process is still time consuming and frustrating. Without a good test suite, you can not expect your developers will be brave enough to do the right thing (I mean actually refactoring the code to fit the business model)

But, let's face the cold fact. It doesn't matter. Small flow in production trumps everything. Everything comes second.

Experience 2: Single module test matters

My principle is really simple. The test should cover the business rule, not the code. The test should guard against the business from financial loss, to ensure the main business process will continue after every release.

My principle is still rock solid. But it does not mean the test should ONLY do that. I didn't get it in the first time. I thought, why are you guys creating test for a single module, and mocks everything around it? I blame the team structure for that. Because the team is setup in solo, so they only care for their own business. So, in the end, no one is responsible for the whole process with every components assembled together.

Without context, my argument seems right. Let me give you some context, we have a such large PHP codebase, with so many dependencies, developer can not even run the code up to the point where they made the change. To make the code actually run, from the index.php to the point of change, you need to setup more than 10 system dependencies with a lot data configuration in the database. Given the code is written in PHP, there is no object model, there is no compile type safety. The only defense between your change and a total disaster is the god damn "small flow in production".

What should you do in this context? Right, run your code before going into production.Give those poor boys a run button, even just to check the PHP syntax is correct. What is the most easy way to make the code running? Yes, just run the module under change, and mock everything else. How to prevent the mock from failing? If we hand write the mock, it will be tedious and fragile. If we get the tracing data from production, and replay it in test environment, it will be cheap to update the mocking behavior.

So, do write test for a single module. It matters, and matters a lot. I was wrong.

Experience 3: Result oriented thinking

Unifying the language/framework/library is the right thing to do. But you need enough interest to justify the cost. The core business of the middleware is encoding/decoding/connection pooling. The attached value of a rpc framework is the unified logging/metrics, and service discovery, resilience. Having a thrift framework with everything included is not enough to justify the cost of switching the stack from http+json. The stability of the core business is more important. Changing from http+json to thrift, we might get the attached value we want, but it also means a lot of things need to be re-tested, re-verified in the production.

What the boss want is NOT unifying the technical stack. NO one is paying for that. They want a distributed tracing system, so that there is a customer complaining bad experience, we can actually look it up through a web interface. They want load-testing the whole business process in production, which requires coloring the testing traffic from normal traffic.

To archive the result, is the unifying stack the right solution? Not really. Even we have 5 languages and 8 frameworks to maintain, we can still make changes to them one by one. Instead of changing one place, we are changing 30 or more places to get the job done. Is it costly? Yes, it is. Is the result done? Yes, it is done, sir. Good.

This is result oriented thinking. What can actually justify the cost of unifying the technical stack? I don't know yet. Maybe, when the boss actually want to have a smaller team so that less people doing similar stuff? Just, maybe.

Experience 4: Classical event based architecture is unpractical

I love event based architecture. I love decoupling. But telling is not working. There are very practical reasons. And classical event based architecture is actually unfit in the context. We should listen first, not blindly impose our will first. Here is the list of why we should stick to RPC

Our business is a realtime business. Delaying a event in the queue, and make it eventually consistent is just not acceptable. Delivering the event to the next system is mission critical to the business. If it is not done, we should try our best to re-route it and keep the main process flowing. Queuing makes the upstream harder to make sure the delivery happen, and happen fast.
The client side is expecting a synchronous response. When a order being finished, how much it will cost you is part of the response and displayed on the screen. It is not acceptable to do the calculation async, which requires a big API/Interaction change. It is just unpractical to change every UI interaction to be async.
Lack of tooling support. Event based architecture requires a lot of tooling to make it feasible. Without good distributed tracing, it is very hard to track down there is no message flow out, is because which link dropped it or missed it. If the message is passed via RPC, we can be sure there is a error log somewhere. And most importantly, every developer knows RPC, and know to look at those error logs.

They are very practical reasons. I found all benefits promised by grand async architecture simply can not justify the cost. I tried to use kafka somewhere in the system, can not find enough applicable places to make a difference. The main business process is still a big ball of mud, you seems can not decouple them with any messaging system without changing the business requirement. Which is really really frustrating.

Mimic RPC with duplex messaging channel seems like a good idea. And we actually tried it. The project finally get cancelled, and I will never try that again. It is a total disaster.

What actually worked? Well, it turns out this is what the system ends up

do rpc and return result in sync way, in normal days
when the down-stream malfunctioned, return a fake result with disclaimer, so that end-user will know or they simply do not care
store the failure message in a async queue
the queue keeps retrying the process, until everything recovered

No one is designing the system to work this way, after patching and patching, it evolved itself into this form. It actually revealed a important paradigm. It is still a event based architecture in a nutshell. Compared to classical design, the upstream and downstream is not completely decoupled. In normal flow, RPC is invoked and result can be returned. When bad things happened, we downgrade the system into async mode. Through reliable event recording, and eventually consistent message replaying, we can guarantee the data is consistent eventually.

===

Be pragmatic, and carry on

VBA Stacktrace

2012-02-27T14:46:00.004+08:00

Logging in VBA is hard, believe it or not. There is no obvious way to find out what went wrong even you decided to do some manual logging. Unlike C# or other "industrial" programming environment, there is no easy way to tell the stacktrace of current execution point. The article describes a way that I invented/found to do it "elegantly" (relatively speaking).It looks like

Public Sub GetStockPrice()
If HEG Then On Error GoTo PROC_ERR
  'sub body
PROC_ERR:
  GetLogger().Error "GetStockPrice"
End Sub

Public Sub SetStockCode(StockCode)
If HEG Then On Error GoTo PROC_ERR
    'sub body
PROC_ERR:
    GetLogger().ReraiseError "StockInfoForm.SetStockCode", Array(StockCode)
End Sub

HEG is a global boolean const, stands for "Handle Error Globally". If we wrap all our sub/function with blocks like above, we can ensure there will be logs printed out when some errors happens. And the stacktrace will be available in the log file, along with the invocation parameters in the invocation chain.

You might wonder how it is implemented. Actually quite simple, it is NOT maintaining a stacktrace some where. If we do it that way, we need to push/pop frame into/from the stacktrace. My way is a little bit simpler, it does force you to maintain a separate stacktrace. When a exception is raised, the On Error GoTo statement will catch it, and GetLogger().Error function call will write out the error to log file. One thing is not that obvious is instead of resume next, the GetLogger().ReraiseError function call will also raise another exception, which can be caught in outter level. Again, the reraised exception will be caught and log to file. This way, a stack trace can be recorded in the file, with the root cause in the top, and the most outside calling place in the bottom.

The complete source code is available here (Logger.cls):

Option Explicit

Const Level As String = "Info"
Const Output As String = "File"
Const ReraisedErrorNumber As Long = vbObjectError + 1985

Private Context As New Collection
Private FileNumber As Integer

Public Sub ClearContext()
    Set Context = New Collection
End Sub

Public Sub Error(FunctionName As String, Optional Args)
    If IsMissing(Args) Then
        Args = Array()
    End If
    HandleError False, FunctionName, Args
End Sub

Public Sub ReraiseError(FunctionName As String, Optional Args)
    If IsMissing(Args) Then
        Args = Array()
    End If
    HandleError True, FunctionName, Args
End Sub

Public Sub Info(Msg As String)
    If "Error" = Level Then
        Context.Add Msg
    Else
        PrintToOutput "Info", Msg
    End If
    FlushOutput
End Sub

Public Sub Dbg(Msg As String)
    If "Dbg" = Level Then
        PrintToOutput "Debug", Msg
    Else
        Context.Add Msg
    End If
    FlushOutput
End Sub

Private Sub PrintToOutput(Level As String, Msg As String)
    Dim FormattedMsg As String
    FormattedMsg = "[" + Level + "]" + " " + CStr(Now()) + ": " + Msg
    If "File" = Output Then
        Print #GetFileNumber(), FormattedMsg
    Else
        Debug.Print FormattedMsg
    End If
End Sub

Private Sub FlushOutput()
    Close #GetFileNumber()
    FileNumber = 0
End Sub

Private Function GetFileNumber()
    If FileNumber = 0 Then
        FileNumber = FreeFile
        Open GetFilePath() For Append Access Write Shared As FileNumber
    End If
    GetFileNumber = FileNumber
End Function

Private Function GetFilePath()
Dim FileName As String
    FileName = "zebra-word-" + CStr(Year(Now())) + "-" + CStr(Month(Now())) + "-" + CStr(Day(Now())) + ".log"
    GetFilePath = Application.Path + ":Zebra:Log:" + FileName
End Function

Private Sub HandleError(ReraisesError As Boolean, FunctionName As String, Args)
    If 0 = Err.Number Then
        Exit Sub
    End If
    If ReraisedErrorNumber = Err.Number Then
        PrintToOutput "Error", "Stack: " + FormatInvocation(FunctionName, Args)
    Else
        PrintToOutput "Error", "Stack (Root): " + FormatInvocation(FunctionName, Args)
        PrintToOutput "Error", "Err Number: " + CStr(Err.Number)
        PrintToOutput "Error", "Err Source: " + Err.Source
        PrintToOutput "Error", "Description: " + Err.Description
        PrintToOutput "Error", "Help File: " + Err.HelpFile
        PrintToOutput "Error", "Help Context: " + CStr(Err.HelpContext)
        PrintToOutput "Error", "Last Dll Error: " + CStr(Err.LastDllError)
        DumpContext
    End If
    Err.Clear
    FlushOutput
    If ReraisesError Then
        Err.Raise ReraisedErrorNumber
    Else
        MsgBox "Opps... something wrong happend. Please send your blame to taowen@gmail.com"
    End If
End Sub

Private Sub DumpContext()
Dim Msg
    PrintToOutput "Context", "Dumping context..."
    For Each Msg In Context
        PrintToOutput "Context", CStr(Msg)
    Next Msg
    PrintToOutput "Context", "Dumped context"
    Set Context = New Collection
    FlushOutput
End Sub

Private Function FormatInvocation(FunctionName, Args)
Dim i As Integer
Dim InvocationDescription As String
    InvocationDescription = FunctionName + "("
    For i = LBound(Args) To UBound(Args)
        If i > LBound(Args) Then
            InvocationDescription = InvocationDescription + ", "
        End If
        InvocationDescription = InvocationDescription + FormatArg(Args(i))
    Next i
    InvocationDescription = InvocationDescription + ")"
    FormatInvocation = InvocationDescription
End Function

Private Function FormatArg(Arg)
Dim ArgType As Integer
    ArgType = VarType(Arg)
    If vbEmpty = ArgType Then
        FormatArg = "[Empty]"
    ElseIf vbNull = ArgType Then
        FormatArg = "[Null]"
    ElseIf vbInteger = ArgType Then
        FormatArg = CStr(Arg)
    ElseIf vbLong = ArgType Then
        FormatArg = "[Long]" + CStr(Arg)
    ElseIf vbSingle = ArgType Then
        FormatArg = "[Single]" + CStr(Arg)
    ElseIf vbDouble = ArgType Then
        FormatArg = "[Double]" + CStr(Arg)
    ElseIf vbCurrency = ArgType Then
        FormatArg = "[Currency]" + CStr(Arg)
    ElseIf vbDate = ArgType Then
        FormatArg = "[Date]" + CStr(Arg)
    ElseIf vbString = ArgType Then
        FormatArg = """" + Arg + """"
    ElseIf vbObject = ArgType Then
        FormatArg = "[Object]"
    ElseIf vbError = ArgType Then
        FormatArg = "[Error]"
    ElseIf vbBoolean = ArgType Then
        FormatArg = CStr(Arg)
    ElseIf vbVariant = ArgType Then
        FormatArg = "[Variant]"
    ElseIf vbDataObject = ArgType Then
        FormatArg = "[DataObject]"
    ElseIf vbDecimal = ArgType Then
        FormatArg = "[Decimal]" + CStr(Arg)
    ElseIf vbByte = ArgType Then
        FormatArg = "[Byte]" + CStr(Arg)
    ElseIf vbUserDefinedType = ArgType Then
        FormatArg = "[UserDefinedType]"
    ElseIf vbArray = ArgType Then
        FormatArg = "[Array]"
    Else
        FormatArg = "[Unknown]"
    End If
End Function

Retrospection: the mistakes I have made these years

2011-01-04T23:20:00.004+08:00

Someone told me, it took more than 10000 hours repeated practices to make a professional mature. I am still far away from the standard, but after 5 years of programming as my profession, I realized I already made so many mistakes, that worth some conscious retrospection.

One presentation I did not watch but really liked their slides: http://www.infoq.com/presentations/LMAX. In the slides, they said:

On a single thread you have ~3 billion instructions per second to play with: to get 10K+ TPS if you don't do anything too stupid.

I have to say, I did do many things smart at first and turned out to be stupid, which made the ~3 billion instructions per second hardware helpless to the project. Not just about performance, there are many mistakes leads to other symptoms as well.

Sometimes, the "I" here can be substitute with "We". I am sure, and I have seen other people made same mistakes as I did. As poor software developer, we do not have control of many things, but code at our hands. It is not surprising people spent a lot of time to make their code "smart". A lot lessons can be learned from those smartness.

I do not have a full list yet, but as a start I will list some here. As this blog is primarily technical, I will keep the items mostly relevant. If I find time I will complete them one by one:

How build tools re-invent scripting language and command line, especially msbuild.
The evil of lazy loading
Other evil things of ORM
How to hack your dependency injection tools to be a rocket science
Build castle on top of sand, aka outlook and isolation
Anything related to Microsoft is evil, especially COM
Encapsulation might helps initially, but not that helpful as you expect, even harmful sometimes
Re-invent the wheel, in many ways and how to make reasons to make it looks good
Abandoned architecture is even worse than wrong architecture
...

doubanclaim8fc580be0d52f419

Package, the missing language feature - Part II

2010-11-27T17:39:00.000+08:00

Problems

In previous post, we have talked about how package works in Python language. Essentially, the problem is, the package is a good box, but color is not black. We want the package to expose all its API at the package level, and seal up any internal details. from A import * should give you all the things you need, you do not need to import A.B, or import A.C.

Unimportable

So, how to make a module unimportable? There are two things you need to do. First, remove the B attribute from package A object. By doing this, import A.B will fail. Because import A.B will first import A, and then import A.B, and then get B from A. By deleting B from A the import will fail. Second you need to remove A.B from sys.modules, by sys.modules['A.B'] = None. This will make from A.B import * fail.

delattr(package_A, 'B')
sys.modules['A.B'] = None

This way, we completely hide the existence of A.B. Which is the behavior we want when other package want to import this private internal. The drawback of this mechanism is that, the error message user get is not friendly. They will be told the module does not exist, but it actually exist if you look it up in the file browser.

When?

By making internal packages modules unimportable we can make the parent package a blackbox. But when we do this, deleting all the internal packages and modules?

The best place is in the __init__.py of parent package. But after we delete the internal packages and modules, they are gone. What if A.B reference A.C in the code? The thing we need to do is to make sure A.B are initialized(imported) before sealing up A. In A.B it might use import A.C or from A.C import xxx, both ways copy the referenced name to local namespace. So even A.C no longer exists, in A.B they can still be referenced.

from .B import *
package_A = sys.modules['A']
delattr(package_A, 'B')
sys.modules['A.B'] = None

Where?

Do I need to write those kind of ungly delattr in every __init__.py file? Isn't that a cross-cutting concern that should not be repeated in every place?

Yes, let's find some way to magically inject those code in every __init__.py file. The code actually has three parts. Part 1, expose API. Part 2, eager load sub modules. Part 3, delete sub modules. API stil need to be manually defined in __init__.py. But part 2 and 3, they can be put into "post-import-hook".

What is post import hook? They are the code executed after module being imported. After A being imported, we can eager load all its sub modules by scanning folder and then delete them. Post import hook is not directly supported in Python, but can be done by more powerful meta import hook.

def register_meta_import_hook(should_apply, post_import_hooks):
    import sys
    import imp

    class Importer(object):
        def __init__(self, file, pathname, description):
            self.file = file
            self.pathname = pathname
            self.description = description

        def load_module(self, name):
            try:
                module = imp.load_module(name, self.file, self.pathname, self.description)
                for post_import_hook in post_import_hooks:
                    post_import_hook(module)
                return module
            finally:
                if self.file:
                    self.file.close()

    class Finder(object):
        def find_module(self, qualified_module_name, path):
            if not should_apply(qualified_module_name):
                return
            if not path:
                path = sys.path
            module_name = qualified_module_name.rpartition('.')[2]
            file, pathname, description = imp.find_module(module_name, path)
            return Importer(file, pathname, description)

    sys.meta_path.append(Finder())

Conclusion

By eager loading sub modules and delete them in post import hook. We can seal up package and force people to define the package API in __init__.py, because that is the only way to let outsider to use the internal.

Another interesting side effect is that the circular dependency between packages are no longer possible. In python, circular dependency between modules are not possible, but because module in package is lazy loaded, so circular dependency between packages were possible. But after eager loading sub modules, we now disabled the circular dependency between packages. It is good thing, but could be very strict.

Finally, we have the box. And automatically seal it up in post import hook. If the box writer want to make the box external usable, they need to define its API in __init__.py. Plus, no circular dependency ever.

Package, the missing language feature - Part I

2010-11-25T23:17:00.008+08:00

Introduction

We have spent a way too much time on functions and classes. We put a lot of energy to maintain a clean and concise interface of classes. We care about the dependency between classes, by encouraging dependence injection and wire up objects via interface.

Besides functions and classes, we do have higher level construct. We have hacked the class loading of java to the hell, and then Satan gives us back his OSGi. We have invented a dedicated job to maintain the manifest of EJB. When dependency injection is not enough, people do find concept called Module emerging in modern things like Guice and Autofac.

What is Package? Itself is merely a name. It is just some annoying leading dots before the thing you actually want. It is not even a thing, it is just a being ignored prefix. People might say, oh yes, package is not doing anything, why should I care? Function is doing something, class is also doing somethings, package is just some dummy folder that I can put those valuable stuff inside it.

True, very true... so does the language designers. I can not say all of them does, but at least some of them does. Stroustrup ignored package. Gosling ignored package. Even Hejlsberg ignored package (But, assembly is better than nothing). What a huge mistake!

The problem we normally need to solve when writing business software is not some scientific work. In my mind, the only problem we need to solve is managing complexity. As we learned long time ago, the only way to control complexity is to break it down, and break it down further. One thing containing many other things. We need blackbox to encapsulate the internal complexity monster and give outside a clean and simple illusion. But constantly, when using Java or C# I find I need to reinvent all kinds of blackboxes to meet my needs. And none of them seems naturally to new comers, simply because they are not part of the original language, not known to most people, and not supported by many tools. There are many design patterns, people say they exist because the language itself is flawed. There are also many component platform/framework, I say it is because the language itself is flawed. It is because the language does not give us the blackbox, so we need to invent one ourself.

Package, it is a missing language feature for a long time. But luckily, Java or C# is not the only choice we have. In another open wonderland, without money but happiness, we have our lovely Python. In there, we finally see what is called package.

Package in Python

Package in python is simple. If you have a folder called some_package, and you have a __init__.py file in that folder, then it becomes a package called some_package. If you happen to have another folder inside some_package folder called another_package, and itself also has a __init__.py file inside the folder, then it becomes some_package.another_package.

The key difference between package in Java and package in Python is, in Java, the package is just a literal symbol, it does not exist in the runtime. In Python, the package is a living object and you set and get attribute on it anytime. some_package.another_package.abc = 'def' is a valid statement in Python language.

This gives us the box we want. We can use this box to define our interface and hide our internal complexities. A package structure like A.B.C, A should hide B, C. B should hide C. In the A level, you might say start the car. In the B level, you might say start the engine, and then start radio and air conditioner. The hierarchical structure of package naming is the best fit for natural encapsulation.

The box can also initialize itself. It has a __init__.py file which can be used to execute any bootstrapping code. Sometimes we need to sort out some internal stuff before ready for outside service. Sometimes, we need to register ourself as subscriber for event published somewhere else. Having simple __init__ solves a lot of problem. It is powerful enough? Not really, it does not support thousand other features, like full lifecycle management, standard remote control interface, etc. As a user facing public component, the package construct exposed by the language is very limited. But we can build on top of what language provides us.

The real problem of python package is not it does not support things like JMX. The real problem is the box is not really a black box. Actually, everything in Python is sort of made by Glass. You can see right through nearly everything, that is public in Java term. Although we can use _ as convention, and __ as hard compiler constraint in some place. But here, the ugly underscore is not helping. You can always reference A.B.C.xxx anytime you want, and that is dangerous. It breaks encapsulation, introducing tangling dependency without being noticed. No one likes that, we want to make sure Y.X.Z only reference A.B.C.xxx through A.yyy. It should not know the internals like A.B.C.xxx.

Classic "pythonic" response would be, that is just a convention. When convention is there, people should follow. The problem is, this is not a easy convention, and can be broken in any minute. There is no easy rule people can follow. The real difficulty is, when you reference A.B.C.xxx you can not always reference it as A.yyy. If your code lives inside A.B, then you should not reference A.yyy, because the inside package should not reference the outside, as it is in the lower place in the dependency pyramid. In this case, you do need to reference A.B.C.xxx as A.B.C.xxx as it is some thing you have to deal with. It is no longer a hidden internal, you are living inside the internal. In other words, A.B.C.xxx is not always public or private. It is accessible or not depending on where you are. And that is exactly what encapsulation is about.

How can we make the box really a black box? Let's continue in the Part II.

Update 11.27

Alan (https://alanfranz.pip.verisignlabs.com/) commented: I'm not 100% sure about what you'd like to say in the next part, however:

beware about "bootstrapping code". Many times such "static initializer" is known to provoke unpredictable problems, and will prevent package breakup via pkg_resources namespace_package , if ever needed.
Most of the times initialization should be performed by the client code or should be performed at first request; import-time initialization is absolutely abused in python coding.
convention is fine. If anybody imports a module or package that starts with underscore, it's their business - after all, if they've got the source code, they can modify all the names and make them public, can't they? Would you prefer java-like things where you can set methods and attributes private, and then you can access them via other means through some common.util.lang tool?
nesting too much might just be unneeded, and if you want a leaf (a.b.c) not to depend on its parent you can use relative imports. But remember that two modules importing one the other trigger an error in Python, you simply can't do that.

Thanks Alan! I am surely aware of "bootstrapping code". The major problem of import time initialization is it is implicit. And it will be even worse if the bootstrapping is I/O intensive or causing other side effects. But if there are too many clients, like a lot of unit tests, pushing responsibility to them is also inconvenient. I use __import__('x.y.z') in the main function to implicit stating that I want to use those packages and initialize them now.

Starting with _ means private, that is fine. But a package is not always private, it is public to its siblings, but private to other packages. Traditional visibility only allow you to specify one thing is public or not, I think that is not enough. How many details you are allowed to know, or should depending on, might be contextual.

Python does check the circular dependency on module level. But it does not check circular dependency on package level. For example, A.B can not circular depend on C.D, but if A.E depend on C.D and C.D depend on A.B, that is allowed. But actually, that means package A depend on C, package C depend on A as well.

Data Migration (3)

2010-02-22T17:48:00.003+08:00

The final question about data migration. How to write it? We already know it is just a function to transform a dictionary into another dictionary. We also know there will be question around dependencies between entities. So, what we need are several functions, with each one upgrade one version. The migration function is per version delta, not per entity. The function need to do several things:

Find out what are the entities need to be migrated.
Load the entity state as dictionary.
Apply the migration logic on the dictionary.
Save the entity state back.

For finding entities to be migrated is easy. We already know SQLServer as XQuery support. We can write customized xquery to find out what are target entities. Most of time, it will be based on CLR_TYPE. Then the only thing that being a problem is how to write a function to transform a dictionary in memory to another dictionary.

It might seems easy, we just need to write a function in C#, which takes dictionary as input and return a new dictionary. Yes, this would work. But the code would be very details, and looks unintentional. It would need to a lot of casting to cast a entry to a list or a string or another dictionary, based on your knowledge of the object graph. It also need to do a lot of detailed operation, like copy a field to another and delete the older field to do a renaming. The issue of plumbing code might be solved by introducing non-static typed language, like ruby as a migratioin scripting language. But the more essential problem is how to raise the abstraction level, so that the migration script can looks like more intentional, and reveals the original requirements.

One naive change is we write several function for the well known refactorings. Like we can have Rename, MoveType, ExtractEntities. And that was exactly what we have tried before. The problem of these small functions are they are not really that reusable. Say, we have a rename function, which a change the direct field from one name to another. But what if we renamed a field but it is inside the object graph not directly on the root. Then the rename function can no longer help us. We might think we can abstract the "locating" part of the function. Instead of passing in two string to identify the fields by name, we pass in two locators.

The locator is not easy to implement. Say, we are renaming field x.y.z to x.y.k, x is branch of the root, and y is branch of x, and z was the field on y, and k is the new name of z. The rename function need to take "x.y.z" and "x.y.k" as input, and know how to apply them. For "x.y.z" we need to "get" the value, and then use "x.y.k" to "set" the value, then use "x.y.z" again to delete the field. The logic of getting value is very different from the logic of setting value.

In general, this apporach was called Functional Programming. By decomposing the big function to smaller one, and then compose them back to cope with different situations, we can maimize the reusbility.

Flattening & Rebuilding

2010-02-21T11:33:00.003+08:00

Using SQLServer as your nosql database to persist objects state has issue with data migration. I mentioned three in the previous post. There are two more, today we are going to talk one of them. You can not deserialize object state back to class whoes fields have been changed. Think about a class used to have a field called name, now the field changed to firstName. When deserialize, where should the value of name assigned back to? If we can not get the raw value, how can we apply the data migration logic? We already talked about the data we stored in SQLServer is XML, are we going to parse the XML and manipulate the xml element directly? Yeah, I think we have to.

So the data migration logic is not applied on the same object model which your application logic dealing with. It has to be at a lower level. We could data migrate the xml data elements, but that is just too tightly coupled with the data format. Before we use the xml, we actually tried JSON for a month, until we found the XQuery is really a killer. Also, xml element has many things we do not care about. So, what we need is a model which can capture the states of the objects, but simple enough. This is model is also directly related to how serialization/de-serialization is working. It works like this:

object ==Flatten==> many dictionaries(with dict/list/string inside) ==Serialize==> XML

XML ==Deserialize==> dictionary(load referenced entity state on demand) ==Rebuild==> objects

The XML looks like this:

<Entity CLR_TYPE="Domain.Calendar.Location" Country="China" 
CountryAbbreviation="CN" LId="43-123" TaxUnit="Jiangxi" TaxUnitAbbreviation="JX" />

It will be deserialized to a dictionary containing 5 entries. The CLR_TYPE will be used in the rebuilding process to rebuild the dictionary back to a object. Except dictionary, string is also valid. string is used to store the field name as well as the simple field value. The persistence layer need to define how to translate a date time into a string, and etc. Collection is also valid. Although in theory, collection is just a special case of dictionary.

The XML is just state storage for a single entity. Entities can inter-relate with each other. We are not going to store the state for other entity in same XML. They will be referenced by ID, and stored separated in different rows in the table EntityState.

Because we separated the serialization into two phases, that is why we can do data migration. The data migration is just a function, who take a dictionary as input, and produce another dictionary. Now the only problem is, how can we write such kind of function? Yes, it might be trivial for just one version, but if you are going to change it very frequently doing agile software development, then it is a big issue. We are going to talk about the "reusbility" of data migration rules in the next post.

Data Migration (2)

2010-02-20T14:14:00.002+08:00

There are three difficulties about nosql data migration, particularly if the data is serialized object state:

Relationship between entities causing the dependencies between data migration rules.
Lack of ad-hoc query support.
Hard to migrate in batch.

The two approaches we talked about: Migrate on load, Migrate in one time. Both of them pros and cons, and they are very much related to the three difficulties mentioned above.

The pros of Migrate on load: It do not need to shutdown your database or application. So, in theory, live migration is possible this way. Another related big benefit is you spread the cost of data migration over the period. So, if the data set is huge, it is very economical to do it this way. Especially a lot of the data are not frequently being used.

The cons of Migrate on load: very very difficult to deal with the dependencies between data migration rules. Not able to fail fast, if there is flaws in the data migration code itself. Also the design more sophisticated so more likely to run into problem.

The pros/cons of Migrate in one time is exactly the opposite of the above. Both of them have problems to deal with lack of ad-hoc query support. For example, if you want to change a reference from id to a business id, then it is very likely you need to translate from one particular id to another business id. This kind of query is very unlikely to have designed index tables. So if we do not have ad-hoc query support, then the data migration code is very hard to write. You might need to build special index table just for data migration purpose. Luckily, if we use SQLServer as nosql database, then we can leverage the xquery capability.

For batching, if you migrate on load, it is not a problem. But if you migrate in one time, it might be very time-consuming. It is now the NO.1 concern in my team around data migration. I have no good solution to this one yet. Previously, we write data migration using SQL, it is batch processing in it's nature. But now we do not have schema, so SQL is not applicable anymore, which means more RPC round-trip involved in the data migration. We need to literally load the whole database out. The long term mitigation is to introduce Map-reduce.

The not so well known problem is the problem of dependencies. One simple question. If we move a field from class A to B. When we load the object of class A, should we do the migration of class A and referenced object of class B? When we load the object of class B, should we do the migration of class B and referenced object of class A? Then, are we running into the circular reference problem here? This is just a obvious example, there are much more not so obvious examples. For example, if we delete a class. That means all data migration referencing that class must be all executed against the whole database, otherwise we have possibility to not able to load object of that class anymore. How can we avoid that? Let's talk about it in next post.

Data Migration

2010-02-19T21:23:00.003+08:00

For nosql data storage, there is one thing not being talked about very much. That is migration. Typical answer to this is, for nosql database, there is no schema. So, there is no need to migrate the data. But this is a lie, although the database do not need a schema to store data, that does not mean data itself do not have structure.

Nosql claims to be very flexible so that semi-structure data can be stored. But if we use SQLServer as nosql database to persist objects, then the data itself must be structured. If we change the class definition, then the data can not be de-serialized, unless we migrate the data to match the new class definition.

The first typical response, no schema no migration does not work. The second typical response is backward compatible. The idea is, let the application code able to handle the data with older version. This might work, if the data structure (especially the inter-references between the objects) is simple, and the format application code consume is not strict (De-serialization according to class definition is a extremely strict). Also, the more older version need to be compatible, the more complex the application code would be. So, this approach is not pervasive and scalable.

The third response is, add a hook to the persistence layer. Before loading the data for use, check the version of the data. If the version is older, then execute a special transformation to upgrade the data into a newer version, so that the application code can be orthogonal with data migration logic. It seems very optimal, but actually not that easy to implement in practice.

The fourth response is, do a full data migration just like traditional RDBMS data migration. Load all the data out, transform them and save them back.

For object persistence, only approach 3 and 4 would work. And we tried them both, and got lots of surprisingly experiences. In next post, we will look into them in-depth, and see what will work, what work well, and what seems good solution actually costs a lot.

Avoiding N+1 Problem (2)

2010-02-18T17:00:00.002+08:00

In the previous post, it talked about how to use batch load to avoid loading references one by one, causing too many queries. First we start with optimizing loading collection. Then we found, all references can be optimized in the same way. But for the indirect references, it seems like very tedious to optimize.

How can we make things batchable into a batch without making the code looking like a mess? For example, object A reference object B, object C. And object B reference D, E. And object C refernece F, G. How can we loading D, E, F, G in one batch? This was not a issue if we use tranditional ORM, because the schema are different for B, C, so the SQL will be different, there is no way to do such kind of batch loading. But now because all entities are stored in the same schema in EntityState table, it is logically possible to do this optimization.

The difficulty is not about loading or batching the entities. The loading is just the same SQL. The differences between loading D, E, F, G is the post processing. For different object need to be loaded and then assigned to different fields. So it is essential to know what the post processings are. A ideal way to do this in C# is:

IDictionary<Guid, Action<EntityState>> accumulatedCallbacks

If we can store the post processings in a dictionary called accumulatedCallbacks, then we can decide when to do the post processings. So, instead of doing

var entityState = entityStateLoader.Load("xxxx");
DoMyPostProcessing(entityState);

we pass the post processing as Action, and store them in the dictionary. Then, when the batch is "big enough", we can call those callbacks passing the loaded entity states.

entityStateLoader.LoadLater("xxxx", DoMyPostProcessing);

Now, this seems works, except when are we going to call these accumulated callbacks. When are we going to load entities with those ids? To answer this, we'd better to look at the code

public void Flush()
{
  while (accumulatedCallbacks.Count > 0)
  {
    var callbacks = new Dictionary<Guid, Action<EntityState>>(accumulatedCallbacks);
    accumulatedCallbacks.Clear();
    ApplyCallbackOnEntities(callbacks.Keys.ToArray(), callbacks);
  }
}

private void ApplyCallbackOnEntities(IEnumerable<Guid> ids, Dictionary<Guid, Action<EntityState>> callbacks)
{
  var loadedStates = states.BatchLoad(ids);
  foreach (var state in loadedStates)
  {
    callbacks[state.Id](state);
    callbacks.Remove(state.Id);
  }
  foreach (var callback in callbacks.Values)
  {
    callback(null);
  }
}

The important stuff is in the while loop. The sequences are:

Copy accumulatedCallbacks to a local variable
Clear the accumulatedCallbacks
Loading happed: states.BatchLoad(ids)
Each callback being called
A tricky thing is, while callback being called, the accumulatedCallbacks will accumulate more callbacks in the mean time, because the callback will call LoadLater to load its references as well.
If the accumulatedCallbacks not empty, repeat the steps again

Essentially, we turned a sequential process into a async connected steps, which is also known as continuation. Then, we can archive better runtime performance and still not making the main logic (load and assign back to fields) not knowning the the performance optimization we have done. Another example of separation of concern.

Avoiding N+1 Problem

2010-02-17T11:45:00.003+08:00

In the previous post, I talked about how to use SQLServer to store objects in a nosql way. But that leaves a opening question, "How can we avoid N+1 problem?". What is N+1 problem? Let's do a quick recap.

N+1 problem is also called 1+N problem and ripple loading. Loading "1" object, we also need to load "N" objects it references to for "N" times, one by one. So, that's why it is called 1+N. Why it is a problem? The problem is the overhead of network and sql execution.

Database <---Overhead--- Application
Database <---Overhead--- Application
Database <---Overhead--- Application
Database <---Overhead--- Application

The more sql we issued, the more overhead it would be. So, a natural solution is to batch the operations. If for the "N" reference, we just need "1" sql to load them all, then the problem is no longer a problem.

A quick solution would be loading all the objects inside a collection with one sql. For example, a object "User" has a field "messages" with collection of "Message". Then assuming the user kayla has a messages referencing message with id 2, 3, 4, 5. Then we just need one sql to load all her messages.

SELECT * FROM EntityState WHERE id IN (2, 3, 4, 5)

However, this solution only works for the case of loading collection reference. But if the "User" class also reference "Department", "Manager", "Calendar"... For each reference, we still need a separate SQL to get them because they are not in a collection.

Moving further, we can try to iterate all the fields of a object, get all the references. Also, for field with collection value, find out all the members. Then combining them together, we can load them together with one SQL again.

For example, kayla has Department with id 6, Manager with id 7, Calendar with id 8. And the messages referencing 2, 3, 4, 5. Then we just need

SELECT * FROM EntityState WHERE id IN (2, 3, 4, 5, 6, 7, 8)

And for the result set returned. Assigning 2, 3, 4, 5 back to the field messages. Assigning 6 back to field department. Assigning 7 back to field manager. Assigning 8 back to field calendar.

Does this solve all of the problems? Not yet... How about manager also has reference to several other objects 9, 10, department reference several other objects, 11, 12. Should we load 9, 10, 11, 12 in one sql? How can we do that?

Use SQLServer as your nosql database

2010-02-16T22:11:00.009+08:00

You might be wondering when you are able to use those cool nosql database in your project. But why? You manager might ask. You'd better be prepared. I see nosql database provides two benefits:

Scalability By removing schema, the data entry can be very easily replicated. By removing foreign constraint, the data entry can be replicated without replicating all the entries it references. Then we can build shading around user boundaries.
Productivity The object in the memory and the persisted object state in the database are really the same thing. If we need to do significant mapping between these two models, there must be something wrong, the productivity might hurt as well as the performance. If the two are really the same thing, why we have to store the object state as relational data? Using nosql database, the persistence of objects can be as easy as serialization.

If you think the second one is more likely to attract you and your manager, then SQLServer might be used as nosql database for your project. SQLServer? Yes, actually any RDBMS could be used as nosql database. The problem we are trying to solve is how to persist objects to RDBMS optimizing for developer productivity and optional runtime performance and scalability.

Difficulty of RDBMS object persistence

OR-Mapping constraints the objects design and it is a overhead just like the memory management in C++ programming. It is become very annoying when you refactor often.
Query design. The complex query requires highly skilled SQL writer. Even you can get it running correctly, but might have a problem to make it performant. The ORM solution add another level of complexity to require you specify the loading strategy.
N+1 problem. Loading a deeply nested object graph very often leads to N+1 problem. The so called ripple loading is probably the most often seen performance problem while using ORM.

By using the idea of nosql database, we can can overcome those things and let your developers only focus on using object technology to implement the business requirement. The first thing we need to do to claim being nosql is not "not using sql" but "not using the schema".

No schema

We create a table called "EntityState", with two columns: id, xml. With the id being the id of the entity persisted, and the xml being the content of the entity state. So, the "EntityState" state is essentially a key/value pair database, but with more features actually.

But how can I get the object into xml? Thousands of ways, I have to say. The most important thing is to classify your objects into two categories: Entity or Aggreated. Being entity means it has a id, and every reference to this object should reference by id instead of serialize the content into xml. Being aggregated, means its state will be part of the xml. After we have done this, the circular reference problem might encounter is also prevented.

N+1 Problem

The loading of the object is even more likely to run into the N+1 problem if we do not take it into consideration. Say we load object with id 1, and it reference objects with id 2 and 3. And object with id 2 reference objects with id 4, 5, 6... Then SQL issued for loading a single object can be as many as one thousand. This is obviously a problem. The rough idea is using callbacks or continuation like structure. The detailed solution will be described in the later article.

Index Tables

The complex query you used to write is actually doing two things. It query, of course. Also, it build the model to query on the fly. If the model you persist the object happens having the column to query, the query can be as easy as one line. If the model is very far from the thing you want to query about, then you might need to join several tables and doing some SUM calculation in the SQL. If we can create index tables according to the query might have, then the problem become very trivial. The only problem is ensuring the index table get updated when the "EntityState" table getting updated.

By doing this, your SQLServer database is no long the RDBMS you and your DBA familiar with. It might sounds scary, but it might worth trying if you start to think about NHibernate/Hibernate might not be the best solution. I will write more articles on this addressing:

Multi bind in Guice 2.0

2009-06-16T14:57:00.004+08:00

Guice is great tool to do dependency injection. But when you need to bind more than one implementation, or bind more than one instance, or bind a collection, things will be become tricky. After fighting with Guice for a long time, I think it is worth a while to document the tricks I've found.

Bind mutliple instances

Given we have two database to connect in my project. This code will not work

bind(SqlMapClient.class).toInstance(createSqlMapClientForDB1());
bind(SqlMapClient.class).toInstance(createSqlMapClientForDB2());

It will not work not only because Guice do not allow you bind same key(SqlMapClient.class is the key in this case) twice, but also when we use the dependency.

public class Service1 {
  @Inject
  SqlMapClient sqlMapClient;
}

How can we know which database connection we've got? This is a well-known problem, and has been addressed since the 1.0. In 1.0, we have three choices to make it work:

Choice 1: Different Type

We can use inheritance to make two SqlMapClient instances of different type.

public class Db1SqlMapClient extends SqlMapClient {
  private final SqlMapClient delegate;
  // delegate all methods of sql map client
}
public class Service1 {
  @Inject
  Db1SqlMapClient sqlMapClient;
}

Choice 2: Binding Annotation

Key in Guice is not necessary the type itself, it could be type and binding annotation. Use binding annotation, we can bind same type multiple times, although each binding is still using different key (different binding annotation).

@BindingAnnotation
@Retention(RetentionPolicy.RUNTIME
public @interface DB1 {
}

bind(SqlMapClient.class)
.annotatedWith(DB1.class)
.with(createSqlMapClientForDB1());

public class Service1 {
  @Inject @DB1
  SqlMapClient sqlMapClient;
}

Choice 3: Named Binding

Guice pre-defined a binding annotation called "Named". We can use "Names.named()" to create a instance of it.

bind(SqlMapClient.class)
.annotatedWith(Names.named("DB1"))
.with(createSqlMapClientForDB1());

public class Service1 {
  @Inject @Named("DB1")
  SqlMapClient sqlMapClient;
}

Although we have three choices, none of them is perfect. The first one is very tedious, and also the user of the SqlMapClient need to know the concrete type. The second one is better, but still the user need to know which one it depends on by annotate its dependency. Still kinda of violated the principle of "Inversion of Control". Also, we need to define one more class for binding annotation. The third choice do not need us to define new class, but it is not refactoring friendly, and can not be found by finding references. So the recommanded way to do that is strong typed binding annotation.

Choice 4: Guice 2.0 Child Injector

In Guice 2.0, we can use child injector to define different binding to the same type.

db1Injector = injector.createChildInjector(new AbstractModule() {
  public void configure() {
    bind(SqlMapClient.class).toInstance(createSqlMapClientForDB1());
  }
});
db2Injector = injector.createChildInjector(new AbstractModule() {
  public void configure() {
    bind(SqlMapClient.class).toInstance(createSqlMapClientForDB2());
  }
});

Different database connection need to be injected by different injector. To use this style, your system has to be partitioned to be managed by several different containers. It is not practical in real world.

Bind Set

It seems easy, isn't it?

bind(Set.class).toInstance(new HashSet());

But what if we need to bind two set, one for set of integer, another for set of string. How to do that?

bind(new TypeLiteral<Set<String>>(){}).toInstance(new HashSet<String>(){{
  add("Hello");
  add("World");
}});

Or we can use the Types utility class introduced in Guice 2.0.

bind(Types.setOf(String.class)).toInstance(new HashSet<String>(){{
  add("Hello");
  add("World");
}});

This also seems easy. But how about the element of the set is not just a simple String. What if we have a interface called OrderProcessor:

public interface OrderProcessor {
  void processOrder(Order order);
}

Then we can have different OrderProcessor to process the order differently (send email, save the order into database):

public class MailOrderProcessor implements OrderProcessor {
  @Inject
  EmailSender emailSender
  // send mail
}

public class DbOrderProcessor implements OrderProcessor {
  @Inject
  SqlMapClient sqlMapClient;
  // save order to database
}

Ok, now how to bind set of order processor? Can we do this?

bind(Types.of(Set.class)).toInstance(new HashSet<OrderProcessor>(){{
  add(new MailOrderProcessor());
  add(new DbOrderProcessor());
}});

No, you can't. Because both of them have its own dependency. Manually newed instance will not inject those dependencies. To make it work, we have four choices:

Choice 1: Manually call injectMembers

@Inject
Injector injector;
for (OrderProcessor orderProcessor : orderProcessors) {
  injector.injectMemebers(orderProcess);
}

Choice 2: Wrapping the set

public class OrderProcessors {
  private final Set processors = new HashSet();
  @Inject
  public void setDbOrderProcessor(DbOrderProcessor processor) {
    processors.add(processor);
  }
  @Inject
  public void setMailOrderProcessor(MailOrderProcessor processor) {
    processors.add(processor);
  }
}

Choice 3: using getProvider

bind(new TypeLiteral<Set<Provider<OrderProcessor>>>(){}).toInstance(new HashSet<Provider<OrderProcessor>>(){{
  add(getProvider(DbOrderProcessor.class);
  add(getProvider(MailOrderProcessor.class);
}});

Here, we used the feature of AbstractModule called getProvider. Although we can not call injector.getInstance() inside a module, but we can get the provider of the instance. This way, what we got is a set of the provider of the processor, instead of a set of order processor. This might not what you want.

Choice 3: getProvider + ProvidedOrderProcessor

public class ProvidedOrderProcessor implements OrderProcessor {
  private final Provider<OrderProcessor> provider;
  public ProvidedOrderProcessor(Provider<OrderProcessor> provider) {
    this.provider = provider;
  }
  public void processOrder(Order order) {
    provider.get().processOrder(order);
  }
}

Now, we can get a order processor instead of the provider of it.

bind(Types.setOf(OrderProcessor.class)).toInstance(new HashSet<OrderProcessor>(){{
  add(getLazyInstance(DbOrderProcessor.class));
  add(getLazyInstance(MailOrderProcessor.class));
}});
OrderProcessor getLazyInstance(Class<? extends OrderProcessor> clazz) {
  return new ProvidedOrderProcessor(getProvider(clazz));
}

Not as easy as we thought, right?

Bind one collection by multiple modules

What if we want to bind one instance of set using multiple module? There is a extension to Guice allow us to do that.

public class Module1 extends AbstractModule {
  public void configure() {
    Multibinder<OrderProcessor> multibinder
         = Multibinder.newSetBinder(binder(), OrderProcessor.class);
    multibinder.addBinding().to(MailOrderProcessor.class);
  }
}
public class Module2 extends AbstractModule {
  public void configure() {
    Multibinder<OrderProcessor> multibinder
         = Multibinder.newSetBinder(binder(), OrderProcessor.class);
    multibinder.addBinding().to(DbOrderProcessor.class);
  }
}

Seems perfect? By the way, multibindings extension also support Map.

Limitation

But how about list? There is no official support to bind a list by multiple modules. Also, how to bind a chain of responsibilities (A.K.A decorators)?

public class DecoratedOrderProcessor implements OrderProcessor {
  private final OrderProcessor decorated;
  public OrderProcessor(OrderProcessor decorated) {
    this.decorated = decorated;
  }
  public void processOrder(Order order) {
    try {
      decorated.processOrder(order);
    } finally {
      // do something;
    }
  }
}

When we have multiple decorators, which formed a chain of responsibilities, then the scenario becomes complex. If there is only one module, then we can use similar techniques like "ProvidedOrderProcessor" to bind it. But if there are more than one modules need to bind a element of the chain, then there is no official way to do it.

Use Guice to build extension point

Comparing Guice and Spring, one advantage I see is Guice promotes the modular design. By grouping functionality into modules, we can see plug and unplug some implementation based on the environment and requirement (for example, test and production). It is also possible in Spring, to be fair, but it is just easier and more often used in Guice world. Using Guice, we can define something as default, then allow other module to be plugged in and override it. Here is a list of techniques you can use to make this kind of effect:

Choice 1: @ImplementedBy, @ProvidedBy

@ImplementedBy(MailOrderProcessor.class)
public interface OrderProcessor {
}

Then, in case all modules did not specify the binding for OrderProcessor, then the default one (MailOrderProcessor in this case) will be used. If there is a binding bind(OrderProcessor.class).to(DbOrderProcessor.class), then that one will be used. This feature is really neat, mostly in the case when we need to change something in the unit test environment.

@ImplementedBy
public interface CurrentTimeProvider {
  DateTime getNow();
  public static class DefaultImpl implements CurrentTimeProvider {
    public DateTime getNow() {
      return new DateTime();
    }
  }
}

In the production environment, the CurrentTimeProvider will automatically use the default implementation. But in the test, we can bind(CurrentTimeProvider.class).toInstance(new FixedTimeProvider(2008,5,12)); then we can write the test eaiser by fixing the time.

Choice 2: @Inject(optional = true)

public class ProcessOrderService {
  @Inject(Optional = true)
  OrderProcessor processor = new DummyOrderProcessor();
}

When the provider side can not pick a default implementation, but the user side do know its default choice, then we can annotate the dependency as optional, and set a default value to it. When there is no binding to OrderProcessor, then the feature will be disabled by using DummyOrderProcessor. This behavior can be changed by plugging new module providing a implementatio of OrderProcessor.

Choice 3: Multibindings

The extension of Guice we've mentioned above allow us to bind a set or map by multiple modules. Using this extension, we can allow new module to plug in their new implementation to modify the system behavior. Very useful way to provide extension point.

Choice 4: Module Override

This is a new feature of Guice 2.0. Easy to use, and "powerful".

Module finalModule = Modules
.override(new DefaultModule())
.with(new CustomizationModule());

If CustomizationModule defines same key as DefaultModule, the one defined in DefaultModule will be overriden. It is useful in some case, but I don't think it is a good feature. Instead, if possible, we should split big module into smaller modules, and compose them depending on our needs, instead of override them from outside. But, Modules.override opened a way to allow multiple modules to bind same list, or even a decorator chain:

Key CUSTOMIZABLE_KEY = Key.get(OrderProcessor.class, new Before(MailOrderProcessor.class));
bind(Types.listOf(OrderProcessor.class)).toInstance(new ArrayList<OrderProcessor>(){{
  add(new ProvidedOrderProcessor(getProvider(CUSTOMIZABLE_KEY));
  add(getLazyInstance(MailOrderProcessor.class);
}});
bind(CUSTOMIZABLE_KEY).toInstance(new DummyOrderProcessor());

Before is a binding annotation. In another module, bind CUSTOMIZABLE_KEY again then we can override it:

bind(CUSTOMIZABLE_KEY).to(getLazyInstance(DbOrderProcessor.class));

Anemic Domain Model

2008-05-09T16:50:00.010+08:00

Martin wrote a blog a long time before: http://www.martinfowler.com/bliki/AnemicDomainModel.html. It was about domain model without rich behavior (anemic). Today, I am going to analyze why we have this problem, and try to give a elegant solution. Let's give a example first. This is a task management system. Two entities in the domain, Employee, Task. So we can write the relationship as following codes:

public class Employee {
    private Set<Task> tasks = new HashSet<Task>();
}

public class Task {
    private String name;
    private Employee owner;
    private Date startTime;
    private Date endTime;
}

It is a very typical parent/child relationship. Now, I want to add a behavior to my domain model. The behavior is: get all the processing task owned by a specified employee. If we ignore the existence of database, very naturally, this behavior belongs to Employee entity.

public class Employee {
    private Set<Task> tasks = new HashSet<Task>();
    public Set<Task> getProcessingTask() {
       ...
    }
}

But if we do care the database. This design is not acceptable. Where can I get all my tasks? Are you going to load all my tasks when building the employee object? If we only have five tasks, that is OK. But if we have 5000 tasks, that probably is not acceptable. So, before the age of hibernate, we wrote:

public class TaskDAO {
   public Set<Task> getProcessingTasks(Employee employee) {
      ...//sql
   }
}

hmmm, wait a moment... Is DAO part of domain model. Yeah... you can. Just rename it to TaskRepository, then it is part of your domain model. Really? I don't believe it. DAO is not part of your domain model. Instead, it stole the logic from domain. It is the reason why our domain model is anemic. Because the getProcessingTasks was part of Employee, but now belongs to a DAO. Can hibernate solve the problem?

@Entity
public class Employee {
    @OneToMany
    private Set<Task> tasks = new HashSet<Task>();
    public Set<Task> getProcessingTasks() {
       ...
    }
}

yes! Hibernate rocks! Have we succeed? No, not yet. Hibernate can make the tasks lazy-loaded. But you only have two options. Load, or not. If you are iterating tasks inside the impl of getProcessingTasks, you still end up as loading all the tasks from the database. To solve this problem, many people tried many different ways. The goal was "injecting something" into domain, then domain can execute query itself. The attempts including using hibernate interceptor, static code instrument, aspectj... Spring gave a answer to this:

@Entity
@Configurable
public class Employee {
    private TaskDao dao;
    public Set<Task> getProcessingTask() {
        return dao.getProcessingTask(this);
    }
    public void setTaskDao(TaskDao dao) {
        this.dao = dao;
    }
}

The @Configurable annotation was introduced to inject DAO into domain model. Now, the domain can do what it supposed to do. Really? domain model depending on DAO made lots of people unhappy. The argued, the cyclic dependencies between DAO layer and Domain layer. The argued, domain should not be "bound" with database or any container. I personally think, it is not that a big issue... I think RoR Active Record is bounding the domain model with database, people still love it. Anyway, I started again, and looking for a more elegant solution. Finally, I found, what if I wrote this:

public class Employee {
    private RichSet<Task> tasks = new DefaultRichSet<Task>();
    public RichSet<Task> getProcessingTasks() {
        return tasks.find("startTime").le(new Date()).find("endTime").isNull();
    }
...
}

RichSet is a Set with extra capabilities (query, sum...)

public interface RichSet<T> extends Set<T> {
    Finder<RichSet<T>> find(String expression);
    int sum(String expression);
}

DefaultRichSet is pure in memory implementation of those operations by iterating the set. So you can new a Employee in your unit test, and test the getProcessingTasks right way. No need to worry about database or dependency injection. Do you feel better? But, where is the database? Er... This is complicated, you know. The first thing I need to do is mapping the entity in Hibernate. Er... hibernate do not like it. Hibernate expect a Set, not RichSet. I think I need to write more things to make hibernate happy:

<hibernate-mapping default-access="field" package="net.sf.ferrum.example.domain">
    <class name="Employee">
        <tuplizer entity-mode="pojo" class="net.sf.ferrum.RichEntityTuplizer"/>
        <id name="id">
            <generator class="native"/>
        </id>
        <set name="tasks" cascade="all" inverse="true" lazy="true">
            <key/>
            <one-to-many class="Task" />
        </set>
    </class>
</hibernate-mapping>

What is tuplizer? It is used by hibernate to replace your set with hibernate enhanced set. So, I wrote my own tuplizer, and replace your set with my enhanced set.

public class RichEntityTuplizer extends PojoEntityTuplizer {
    public RichEntityTuplizer(EntityMetamodel entityMetamodel, PersistentClass mappedEntity) {
        super(entityMetamodel, mappedEntity);
    }

    protected Setter buildPropertySetter(final Property mappedProperty, PersistentClass mappedEntity) {
        final Setter setter = super.buildPropertySetter(mappedProperty, mappedEntity);
        if (!(mappedProperty.getValue() instanceof org.hibernate.mapping.Set)) {
            return setter;
        }
        return new Setter() {
            public void set(Object target, Object value, SessionFactoryImplementor factory) throws HibernateException {
                Object wrappedValue = value;
                if (value instanceof Set) {
                    HibernateRepository repository = new HibernateRepository();
                    repository.setSessionFactory(factory);
                    wrappedValue = new HibernateRichSet((Set) value, repository, getCriteria(mappedProperty, target));
                }
                setter.set(target, wrappedValue, factory);
            }

            public String getMethodName() {
                return setter.getMethodName();
            }

            public Method getMethod() {
                return setter.getMethod();
            }
        };
    }
}

In short, the code means:

employee.tasks = new HibernateRichSet<Task>(...)

This version of RichSet is much smarter. It will translate your find statements from

tasks.find("startTime").le(new Date()).find("endTime").isNull();

--->

DetachedCriteria.forClass(..).add(...).add(...)

Now, in the domain, you can query against your collection without worrying about how the query will be done. Domain is still pure, no dependency on DAO. Domain is still all InMemory, no need to start up your container, your database to test domain logic.

Pain Points of Using XAML or WPF

2008-01-16T16:36:00.002+08:00

Pain Point 1: XAML always create the controls by its default constructor

This means, you need to have a default constructor for you control, and the constructor will always be used by XAML. So, you can not use constructor dependency injection to pass things like services, gateways to your control. Also, you will not have chance to pass data in constructor, although the data might be must-have for the specific type of control

Paint Point 2: can not control XAML to create or not to create some part of GUI

Sometimes, the GUI is not static. It could be dynamic because the GUI would be different for the data it is presenting, such as for a meeting in the past it should show a adding note button, for a meeting in the future it should not. And more often, the security control requires the GUI to be different according to the role.

Paint Point 3: XAML is using XML, which contains too many visual noise

compared to things like YAML, XML is definitely not very friendly to our eyes. The things worse than XML I can come up is the braces of Lisp. Also, XML makes it harder to edit manually

Paint Point 4: Layouting in Grid

Using grid layout currently requires you to specify the row and column for all the children of a grid. It is very error-prone when the grid becomes large. But grid is a must-have for any non-trivial GUI, and there is not replacement for it yet.

Paint Point 5: Things not checked in compiling time

There are lots of things not checked by the compiler in XAML. Things like binding, resource looking up for example. And it is harder to cross reference between xaml and code.

Paint Point 6: More files

one file for xaml one file for cs. It requires more steps to create a new user control and is confusing to new comers.

Paint Point 7: Separating concerns

the default way events get handled is in the partial class of the XAML. It is not a good way of separating concerns and not good oo design. the windows and user controls usually doing too much in rich client application. It is not the fault of XAML in general, but it is not promoting a good model either by its weird way of hooking up event in xaml.

Paint Point 8: Hard to test

It is hard to test in many ways. First, not easy to inject dependency means you can not mock those expensive things like network connection. Second, creating a real window is taking more than ten seconds. Third, many things are in a static singleton model like resource looking up and the single instance application object.

Paint Point 9: Control lazy created with uncertain lifecycle

controls of list item in a list view were lazy created. We can not get those controls easily, and we even can not be sure they are created or not. only those who are visible will be created.

Why Do We Need Mock Framework?

2007-05-11T13:12:00.001+08:00

Given we have a simple behavior to test: a form a text field on the form a button on the form click the button, should set the text of the text field to "Hello" And, here is the code implementing the behavior, MVP pattern is applied here:

public interface View {
  public void setText(String text);
  public void addActionListener(ActionListener actionListener);
}

public class Presenter {
  public Presenter(final View view) {
    view.addActionListener(new ActionListener() {
      public void actionPerformed() {
        view.setText("Hello");
      }
    });
  }
}

Before writing the test in java, let's first write in pseudo-code:

create mock view
create presenter by mock view
fire event on mock view
assert text is set

Then, let's implement it using latest jMock:

@Test
public void test_click_button_should_set_text_hello() {
  Mockery mockery = new Mockery();
  final View mockView = mockery.mock(View.class);
  final ActionListenerMatcher actionListenerMatcher = new ActionListenerMatcher();
  mockery.checking(new Expectations() {
    {
      one(mockView).addActionListener(with(actionListenerMatcher));
      one(mockView).setText("Hello");
    }
  });
  new Presenter(mockView);
  actionListenerMatcher.fireActionPerformed();
  mockery.assertIsSatisfied();
}

Here, we introduced a custom matcher, called ActionListenerMatcher. The reason why we need this, is because we need a way to fire the event. Without the matcher, we have no place to store the listener passed in. Here is the implementation of ActionListenerMatcher:

public class ActionListenerMatcher extends BaseMatcher {
  private ActionListener actionListener;
  public boolean matches(Object item) {
    actionListener = (ActionListener) item;
    return true;
  }
  public void fireActionPerformed() {
    actionListener.actionPerformed();
  }
  public void describeTo(Description description) {
  }
}

What is the conclusion? The intention of developer when writing the test is lost in the long and complex mocking code. How about other frameworks? I have tried EasyMock as well, which is even worse than jMock. Do we have a simpler way? Yes, we have. Check this out:

@Test
public void test_click_button_should_set_text_hello() {
  MockView mockView = new MockView();
  new Presenter(mockView);
  mockView.fireActionPerformed();
  Assert.assertEquals("Hello", mockView.getText());
}

Isn't this simple? MockView is just a simple implementation of View:

private class MockView implements View {
  private ActionListener actionListener;
  private String text;
  public void addActionListener(ActionListener actionListener) {
    this.actionListener = actionListener;
  }
  public void setText(String text) {
    this.text = text;
  }
  public String getText() {
    return text;
  }
  public void fireActionPerformed() {
    actionListener.actionPerformed();
  }
}

So, before you starting to use a mock framework. Think about it, do we really need them?

Async Unit Tesing in ActionScript3

2006-08-10T15:08:00.002+08:00

This is a cutting-edge topic, if you are not interested in programming in next generation flash application, please ignore me:) The most mature unit testing tool available in ActionScript3/Flex2 world is FlexUnit2, which is a oss project hosted at labs.adobe.com. It is a simulation of JUnit, from framework design to usage. Nothing changed, except Thread.sleep is missing in Flash world. If we can not wait for a few seconds, how to test async behavior? To solve this problem, FlexUnit introduce a method in class TestCase, called "addAsync". It takes minimal two parameters. You can use it like this:

  loader.addEventListener(Event.COMPLETE, addAsync(onIndexPageLoaded, 1000));

The value returned by addAsync is a function wrapping your event handler. To look up full documentation about this method, see here. After adding a "Async", FlexUnit will wait for few seconds to finish testing one test method. But here are some findings and tips for you:

FlexUnit is using "Timer" to wait for finishing. It will not pause execution actually, but will check for the result later. i.e:
```
  loader = new URLLoader();
  loader.addEventListener(Event.COMPLETE, addAsync(onIndexPageLoaded, 1000));
  loader.load(new URLRequest("twspike-index.html"));
  doSomething();
```
doSomething will be executed immediately after loader.load(...).
You can not use "addAsync" in setUp. Because setUp and testXXX is two different test cases, so FlexUnit will wait for setUp to finish instead of waiting for your actual testing code to finish. Currently, the error reported is quite mistery.

How to actually wait for something happened than start testing? Here is the home made HOW-TO:

private function onIndexPageLoaded(event:Event):void {
  parser = new IndexPageParser(loader.data);
  checker.call(this);
}
private function check(checker:Function):void {
  this.checker = checker;
  loader = new URLLoader();
  loader.addEventListener(Event.COMPLETE, addAsync(onIndexPageLoaded, 1000));
  loader.load(new URLRequest("twspike-index.html"));
}
public function blog_data_should_not_be_null():void {
  check(function():void {
    assertNotNull(parser.blogData);
  });
}
public function blog_data_should_be_valid_xml():void {
  check(function():void {
    XML(parser.blogData);
  });
}

To summary it up: Wrapping your testing code in a function, Passing it to a check method, saving the testing code and start to execute the async action. In the event handler, call the saved testing code. It is annoying, I know...

Simple Design: "DSL" Reloaded

2006-06-30T09:40:00.000+08:00

I have been silent about DSL for a while. Now, I am back:) After being thinking for several months, I realized most of time, people don't need domain specific language, they just need the code read more nicely. Then, I come up a idea such kind of requirement doesn't need to involve heavy implementation such as grammar, parser or compiler, it could be simple, and it should be simple! So, is Ruby or Smalltalk the right selection? I have to say, I don't think so. The reason why I am not very keen about the idea using Ruby as the environment to embed so called "DSL"( I still refer to the nicely looking code as DSL, sorry about that), is because the language is not invented to support hosting DSL. Method missing or closure or initial block are not intended to use this way:

publishing agreement dated '9/20/2005'
with_author 'Joe W. Author', social('555-493-3920')
for_title 'DSLs for Dummies'

report do
  calculate 'Royalties', as net_retail_sales.during(last_six_months) * 20.percent
end

We are using Ruby too tricky!!! We are not using it, we are hacking it. The side effect is understanding the inner mechanism behind nice code becoming harder and harder, which is leading us to a dangerous direction. That reminds me of similar experience of C++. After introducing template into C++, I think except STL and several other excellent framework addressing some critical issue (mostly performance), others are simply too smart to be useful. Tons of frameworks inside Boost are just trying to make the code looking nicer... My point is if the language did not support the way of writting code we want, don't hack it using powerful trick to hack it even if the father of the language encourage you to do so. (I don't know the attitude of Matsumoto, but I do know Bjarne speaks a lot about extending C++ using framework). If not hacking a flexible scripting language, what can we choose to implement the so called DSL? The one thing I am sure if we need to write complex grammar for a new DSL (actually a English-like language), we are going the wrong direction. Because human language is too complex to be handled by formal grammar specification. So, I like the philosophy behind embeded DSL(Ruby again...or Smalltalk). The DSL is still embeded in a GPL, but the GPL should support hosting DSL so the implementation doesn't need to be tricky. The initial idea came into my mind back to this Feb. But at that time, I thought what we need is a new lightweighted GPL, but it still need to be weak typed, mordern featured scripting language just like Python, Ruby. Part of the reason is I had another nice idea about how to implement a weak typed scripting language effciently in JVM, but I didn't have passion to carry it into reality. This lead me to a not-that-simple-design... Sadly or luckily... I have to say, after several months, I realized it should be more simple than a new scripting language. Then, what is the simplest desing? How about this: assertThat(characterSet, contains('a')); compared with assert_$1$_contains_$2$(characterSet, 'a'); then fommatted to: assert characterSet contains a what we need is a Eclipse plugin to write and read java file in a different view. further more we can that Eclipse displaying inner class like a closure. and $this$ could aslo be a part of the method name to support: list.add(item); list.add_$1$_to_$this$(item); add item to list

Integrating Selenium With Build Script

2006-05-24T16:03:00.002+08:00

[Information Below is about Selenium-Core only] When I was at TWU, michael used three days to put selenium tests into build script. I even don't know the cc.net is actually running the selenium tests finally. From that time, I know, this is not a easy problem. Today it is my first time trying to setup a cc running selenium tests. It costs nearly a whole day to fight with resin and jdk bugs. Fortunately, I won the game at last:) Here is some tips I want to share: Steps:

Start server
Run Tests
Get Result
Stop server

Key Problem is "How to get the result". Selenium will call a url "postResult" After all tests were finished. The official solution to catch the result is writing a servlet. But here are two problems need to be taken into consideration: 1. When to stop the server? 2. How to know tests passed? To solve these two problems, the servlet need to "provide" information about the testing progress and result. So one side, the servlet is a result recevier from selenium; on the other side, it is the selenium testing information provider for build script. Here is my code:

  
public class SeleniumResultServelet extends HttpServlet {
 
 private String result = null;
 
 protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
  OutputStream outputStream = response.getOutputStream();
  if (result == null) {
   outputStream.write("pending".getBytes());
  } else {
   outputStream.write(result.getBytes());
  }
  outputStream.write("\r\n".getBytes());
 }

 public void doPost(HttpServletRequest request, HttpServletResponse response) throws IOException, ServletException {
  result = request.getParameter("result");
 }

}

ps: selenium use "POST" to access postResult URL, build script use "GET" to access postResult URL. Inside the building script, there is a loop:

while(true) {
    Thread.sleep(500);
    URL url = new URL(postResultURL);
    InputStream inputStream = url.openStream();
    BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));
    String status = reader.readLine().trim();
    reader.close();
    if ("failed".equals(status)) {
     throw new RuntimeException("Selenium Test Failed");
    }
    if (!"pending".equals(status)) {
     break;
    }
   }

Above it is my premature way to integrate selenium... ps: the reason why I write this part of build script in Java is not only because I need to integrate selenium, but mainly because Resin can not be stopped by ant under windows (I tried windows service, but failed). So please don't blame me about that...

Office could be a perfect platform for DSL development

2006-04-05T17:15:00.000+08:00

I am thinking a lot about DSL recently. And I found we are doing quick start using Powerpoint. I suddently realized Office is a perfect platform to involve business guy into software development. After some investigation and research, I found Office is not only a user-firendly platform, but also quite extensible for 3-party, especially the future version 2007. Detail implementing technology still needs more researching effort, but I am sure it will be flexible enough to support hosting a DSL development environment. My vision is: Word is the code editing place. You can write, edit, run and debug DSL code in it, including inventing new DSL and using exisiting DSL. Excel is the testing place. Anyone used Fitness will find it is suitable being implemented in Excel. Powerpoint is the GUI designing place, not only single page, but also the flow between pages. In the large scale. Office is a platform used for authoring document. Since DSL is some kinda executable document, there is no reason not utilizing the de facto document editing platform - Office.

Investigation on DSL

2006-03-15T03:34:00.000+08:00

Choose one most important/applicable to you from each level: Level 1 Producativity Correctness Flexibility Level 2 Involving Non-Programmer Reuse Level 3 Higher-Level Abstraction Readabiliy Level 4 Non-Textual Source Code(Advertised by Intentional Programming, MPS) Meta-Programming Visualization(A Step Further from Non-Textual Source Code, Using Table, Diagram...) Good Looking Code Producing Syntax(Method Missing, Keyword Message, Lambda/Blocks/Closure...) Each level represents a goal people seeking in DSL research. Higher level goal is supported by lower level goal. I want to know, in each level, which goal matters most in Business Solution Development. Really thank you for participating in this investigation.

Thoughts on Implementing Dynamic Typed Object in JVM

2006-03-06T21:58:00.000+08:00

Java is a static typed language. JVM is a static typed virtual machine. How to implement dynamic typed object in JVM? I just got an idea: generate a interface for each method. When you call a method, you first cast the object to the interface which supports the method to call. We might end up needing to implement hundred interface for a object. Yep, that is it. Using interface to get workround for static typing limit. The side-effect might be: hard to implement Mixin hard to implement method missing classloader will go crazy... But I think, this way will have a better performance and interop experience, because we are actually using java object model instead of making a new object model using map.

Suggesting a new Language for Inventing Domain Specific Language

2006-03-04T01:26:00.000+08:00

Inspired by Roy's recent speech given to TWU&TWI and Vincent's nice introduction to Smalltalk, I am thinking about the possibility of inventing a language for creating DSL more easily. Inventing a DSL (External DSL in Martin’s definition) is not an easy job. We have to write the grammar in EBNF, and write compiler or interpreter for it. It takes a long time and major efforts to see it works. So, the decision of whether to make a DSL or not very carefully, it may not worth the efforts. And after it is born, changing the syntax or grammar is another big problem. Thinking about the process of making a DSL, we can find we were given too many options there. Writing a compiler from scratch, we have thousands of choices of the grammar, syntax, semantic of the language. Do we need so many flexibilities, and leaving out so many basic and nice language elements (like OOP or GC)? Why we need a new language to cover specific domain? Maybe it can be categorized into two reasons: 1. Improving Productivity (Better Encapsulation, and Reuse) 2. Improving Expressiveness (We can communicate with the clients in a better way) For the first reason, I don’t think DSL can be very useful in this case. The art of organization of codes is a hot researching area. But I didn’t see any new technology improved the productivity a lot recently after the birth of OOP. I believe OOP is a reliable technology to build large business application in a long time, especially complemented with Agile. For the second reason, I found without inventing a new language from scratch, we still can achieve the same goal, that is build DSL on top of an extremely flexible language (Internal DSL in Martin’s Definition). Assuming the main reason to invent a DSL is Improving Expressiveness. What struck the expressiveness of code? I think it is because: 1. The concepts employed in code didn’t fit very well in real world. 2. The grammar or syntax of programming language makes the code looks weird. For the first problem, we can solve it using current technologies such as Object Oriented Programming, Domain Driven Design… For the second problem, we have few choices when we are using languages like Java or C#. If using Ruby, because language provided a bunch of nice features, we are able to do some clever design to make the code looks better, but we are still so limited. If using Lisp or Smalltalk, we are given maximum flexibility to do nice things. The features of Smalltalk which makes its code looks so natural are: 1. Minimal built-in keywords, few things are special (Even if/else, while are implemented using OOP or Recursion) 2. More straightforward grammar (No commas, braces, curly braces) 3. Key Message Given if/else as an example, “ifTrue” is just a message of class True and class False. The difference between the two implementations is one is executing the following block, one just ignore it. So we can introduce new control structures and other things used to be implemented at level of language easily. Key Message is another cool feature. In Java, we can only write: text.addAttributes(attr, start, stop). But In Smalltalk, we can given the message addAttribute a better name involving the information of parameters in it: text addAttribute: attr from: start to: stop the signature of the message consists three parts: addAttribute from to. So the code will read more like English. But Smalltalk is not good enough: 1. Still have some symbols for grammar like [:] [^]. 2. The arrangement sentence element is limited by order “object message”. 3. Can not use space in naming. 4. Can not involve left side of = into sentence nicely (we can not write Create new person, assign it to Michael). So I am suggesting a new language with creating new DSL more easily as its only purpose. It starts from good job done by Smalltalk, and improve it further by solving above problems. The final target is allowing the code reads like English, although writing it may still need much more careful design naming and coding comparing with writing casually in English. The initial thoughts are listed below: 1. Use XML to structure the source code (actually is the abstract syntax tree), programme against GUI representation of the source code instead of writing in text file directly. (So I don’t need to invent any fancy grammar to keep the balance of exactness and expressiveness. Let the XML source file to handle the problem of exactness by documenting parse tree, while maintaining high expressiveness through IDE) 2. Decouple order from invocation. (By introducing the above technique, now we can specify message sent to which object through underlying XML representation. Then it is not necessary to force object followed by the message sent to it. Finally we can say “update window” instead of “window update”) 3. Out parameter. (x = y calcSomething; becomes y calcSomething assign to x. Pass the variable you want to store the return value as special message parameter to object. Then the semantic of “=” can be shown). 4. Then message becomes the skeleton of sentence, we can fill the object we want to operate on, the argument of the operation, and the result variable into the skeleton to form a part of sentence. (move … to … , … is a message, we can say “move pointA to 1 , 2”. The object we are operating on is pointA, the arguments are 1 and 2) I think creating DSL this way is the most economical way. Either writing new compiler or trying to use MDA to save efforts spent on writing compiler will cost a lot of time money and rework. But building DSL upon a flexible OOP language, after some interfacing job, DSL is just a natural extending to the Domain Model we have done today. BTW: Is “SmallRocks!” a nice name? :)

Why object matters in agile development?

2006-02-07T22:48:00.000+08:00

I ask this question to myself and to my colleagues here(Xi'an). I got a pretty cool answer from vincent and seemed to be correct. "Because using object technology, we can do local change easier than old-time procedure based method" I totally agree with him. Agile is all about embracing changes. We have to refactor all the time. If we code in the procedure based way, we found it is very hard to change a function without other functions affected. But if we utilize object technology, we have design patterns and other experiences to decouple objects. Then we are more likely to refactor more, and more effectively. I will ask the same question at TWU, and wanna here different ideas~

Not very agree with the book "object technology"

2006-02-07T09:30:00.000+08:00

I am reading the textbook we will use at the TWU. One of them is "object technology: a manager's guide". It is a great book trying to explain why object matters to bussiness issue. But I am not very agree with the book after reading the first chapter. So I will keep updating my view as moving to the rest. The reason I am not convinced by the book is that I don't think objects in programming should be told to they are just like the objects in the nature. According to my own experience, they are quite different. In my opinion, the objects in the programming world are the projection of the objects in the real world bounded in the context of the problem. If we treat different objects the same, we are tending to make the one object really complex to address all the possible roles they are possible to play in the nature, which is not the exactly way in which we programme. Instead, we should set the problem context first, then analyse the objects inside the context, and then design objects to reflect the real world. So, I think "object technology" is somewhat not that right.