Andrew

High Performance C++ Profiling

floodyberry — Wed, 07 Oct 2009 15:40:44 +0000

[This is about my C++ Profiler which may be found on Google Code under High Performance C++ Profiler]

My interest in code profiling started when I was making hudbot. What with code injection and patching, function hooking, data hijacking, and OpenGL, I knew I had relatively no experience in what I was attempting and that I could easily be producing some amazing slowdowns if I wasn’t careful.

Unfortunately, C++ profilers seem to come in three varieties, all of which have a fatal downside:

Sampling Profilers which are fast, multi-threaded, but inaccurate and have decent output (sometimes too detailed). Some examples are VTune, CodeAnalyst, google-perftools and Sleepy.
Instrumenting Profilers which are accurate, multi-threaded, but slow, and have decent output. Some examples are GlowCode and the now defunct DevPartner Profiler Community Edition.
Instrumenting Profilers which are fast, accurate, but single threaded and have limited output. These range from extremely simple profilers like Peter Kankowski’s Poor Man’s Profiler to the more complicated and full-featured Shiny C++ Profiler.

The obvious outcome is that if you want fast and accurate, like I did, you’ll have to use an existing profiler or write it yourself and instrument your code manually. With a little work, fancy stuff like call trees can be added. Once you get it tested and working, you can start going crazy profiSegmentation fault.

Oh yeah, about that. There are no multi-threaded instrumented profilers that are open source, and depending on how your single threaded profiler works, the results when trying to use it in a multi-threaded environment can range from bad data to outright crashing. It’s possible to patch the profiler to only allow the main thread in, but this adds unnecessary slowdowns and doesn’t address how to profile other threads. This is where my profiler comes in!

Pieces of a high performance multi-threaded C++ profiler

Timing

Latency in cycles and resolution of various timing methods (resolution is hand wavy, not to scale)

The main piece of a high performance profiler is what mechanism is used to get the timestamps. High precision is the obvious main requirement, but it must also have as low a latency as possible. If you’re making millions of calls a second to your profiler, the timestamp mechanism could become the limiting factor in your app’s performance and make it so unresponsive that testing it is infeasible.

On an x86, this means you must go with rdtsc. It is low latency, high precision, and is portable to gcc. This choice is unfortunately not without it’s trade offs. rdtsc does not serialize, so unless you insert a serializing instruction like cpuid before it (and bloat the latency in the process) or use the new rtdscp instruction, the cycle count you receive may not be 100% accurate. rdtsc is not guaranteed to be sync’d across all CPUs in a multi-core / multi-CPU system, so even single threaded timing has the possibility of being incorrect if the thread is scheduled across multiple CPUs. But, and this is a big but, for what I want there is nothing else to use. If someone else has different needs they can replace the timer function, but for the volume of calls I’m interested in, latency needs to be the bare minimum.

Multi-threading

This is the part nobody seems to know or care how to do so I was on my own. To avoid overhead, synchronization must be avoided at all costs for the general path.

Easy solution: Track each thread’s profiler state with thread-local storage and do the heavyweight statistics work on-demand instead of worrying about aggregating on the fly. A little more work needs to be done depending on the details of the profiler implementation, but nothing difficult.

I didn’t want anything complicated or OS specific for the synchronization, so I decided on a home-made CAS solution (i.e. lock xchg on the x86) for the spinlocks and such.

The only change for the user is needing to explicitly enter & exit threads for the profiler. An important note is that while threads must be explicitly instrumented, if you do not instrument a thread, it will not hurt the system in any way; the thread will just not show up in the profile stats.

Storage

Example of a Call Tree

How the profiler data is stored is a function of how much detail is needed. In my case I wanted full call trees which means it has to be dynamically allocated and call nodes must be dynamically located. Obviously this will require a hash table, but it should not be a generic fully functional hash table! Because of the limited scope of what the profiler needs, corners can be cut so that it does what it needs to quickly.

Probably the easiest gain to be made is not hashing the function name when searching for it in the hash table! If you use compile time constant strings like __FUNCTION__, you can simply take the address of the string as the hash value and skip the O(n) hashing of the string. Small fixups may be needed (I use >> 5 on the address as the strings are possibly aligned), but in the end it’s a big win.

If using an open hash table like I am, there is never a need to delete individual entries, so you do not need a tombstone value for a deleted key or a check in your lookup against them. There’s either a value in a slot, or it’s empty.

I also use linear probing because the table should be empty enough and the hash dispersion should be spread out enough that the extra code to do quadratic or better probing would slow the search down in the general case. I do not inline the elements in the table however; The hash table is an array of pointers to individually created profiler nodes, with each node being its own hash table. The size of each node is nearing 64 bytes with 32 bit pointers, so I figured scanning them would produce cache misses regardless, and inlining them in the table was more trouble than it was worth.

Synchronizing

If the goal is to profile an entire run of a program, then you don’t need synchronization. Profile in to the threadlocal profilers, wait for all threads to finish, and combine the results. Easy!

My goal is more difficult. I want to not only dump the profiling results at any time, I want to be able to reset the results at any time. When I’m profiling games, I often want to profile a specific action or sequence of actions. I don’t want to include the startup time and whatever is required to get to that specific thing I want to profile, e.g. running a timedemo or analyzing a horrible script compiler.

So how to synchronize without slowing everything down, while still being able to crawl a potentially mutating hash table safely? I compromised by accepting that a few call nodes may get extra calls by locking on any mutation to the hash table array, but not mutations to the timing information. When the dump/reset are using a thread’s hash table, it will take the thread’s lock, crawl it’s information, and release it. The thread is still free to profile to existing nodes during this time, but any attempt to insert a new node will block.

Reset is handled not by deleting a call tree, but by crawling the tree and resetting each node’s timer to zero (threads which are no longer active are of course fully deleted). This has the byproduct of not only being able to reset all threads when they’re at arbitrary positions in their call tree, but also to call reset from arbitrary positions in any call tree. I ran in to this need the hard way when profiling Tribes script system, and the call to reset the profiler is done with a script command, bad things happened. To keep reset-but-not-yet-active nodes out of dumps, it’s necessary to check that nodes have a non-zero time before walking them.

A small caveat with reset: Because there is no synchronization for existing nodes, it’s possible that

ticks += ( getticks() - started )

could execute at the same time the thread that is resetting everything executes

ticks = 0

Now, if the code is generated so the value is summed directly in to ticks, there is no issue. ticks will either get incremented, then stomped, or stomped, then incremented. This seems to be true in both gcc and msvc release builds, so there is no issue. But in debug builds, ticks can be read in to a temporary variable, summed, and written back to ticks, which could stomp the 0 written by the reset thread. Since this isn’t an issue in release builds and will only happen on a freak chance in debug builds, I don’t see it being an issue, but it’s something to be aware of.

Output

A fully working profiler is still not useful if it’s output is incomplete or difficult to read. Look at the documentation for understanding gprof’s call graph. 5-6 pages of text for a glut of confusing numbers and diagrams when a well-made call tree/graph doesn’t need documentation and can be understood easily with visual examination. Google’s perf-tools require you to run the output through a perl script, and still the results are hideous and hard to read! Which is why I came up with:

Profiler Dump - ASCII

I originally only had the ASCII dumper. For games (which use text consoles) or command line apps, text output is quick and easy. A real tree structure is rendered so you aren’t left playing guessing games as to how far each item is indented or trying to match up rows of numbers to their function names on the other side of the screen. Self time while inside the tree is only useful to that specific branch, so I also include the top callers overall by self time so global bottlenecks can be identified.

Profiler Dump - HTML

Text is great, but I thought it would also be useful to have a fancier dump with things like colored hotspots and mouse hover highlighting. Archiving the generated HTML files is also an excellent way to track and show improvements over time.

As an aside, the HTML dump is actually the main source of the code bloat in the profiler now. I chose to integrate the generation of the HTML dump in the C++ code versus post-processing data with a script because every step between you and the final product is another reason to not use it at all.

Performance

To test out the performance of the type of applications I’m interested in profiling, I ran Thierry Berger-Perrin’s sphereflake app through a variety of profilers on my E8400. sphereflake was built for a single thread so Shiny would run, while my profiler (and everyone else) was running in multi-threaded mode. The only difference in the general case between single and multi-threading mode in my profiler is whether the pointer to the local state is a) static, or b) a thread-local static. The overhead from tls handling is non-applicable, so it can be run in multi-threading mode at all times without worry.

Performance from various profilers on example-sphereflake.cpp. ¹-Glowcode was unable to get an accurate profile.

Non-instrumented run time for sphereflake is about 3 seconds on my E8400, and the functions I profile result in around 10 million calls during that period. The breakdown of performance by profiler:

My profiler is 1.09x slower than baseline.
Sleepy (the sampling profiler) is inexplicably is 1.19x slower. Sleepy itself didn’t seem to use any CPU time, so I can only guess that something it did with the system was affecting context switches or whatnot.
Shiny, switched to use rdtsc as it’s timing mechanism, is 1.27x slower.
GlowCode was unable to profile all of the functions needed regardless of how many optimizations I tried to turn off, hence the ¹. I eventually got it up to around 4.5 million calls, at which point it was slowing the program down by 1.89x. Not too impressive for the “WORLD’S FASTEST PROFILER!”.
Shiny, using QueryPerformanceCounter, is 1.95x slower. This is why QPC is bad for high volume profiling, the latency adds up fast.
Visual Studio 2005’s “Profile Guided Optimization” (not pictured) not only ran 4.75x slower than baseline during profiling, but after it “optimized” the app from the profile data, the app ran slower!

Fin

So that’s how to do high performance profiling! Even with Tribes generating 480+ million calls during a 2 ½ minute timedemo of a 25 minute map, it still runs at 300fps+ and accurately reports what is using time where. (No, it’s not OpenGL::setupBitmapTexCoords. BAD VTune!)

Generating .DLL Wrappers

floodyberry — Mon, 08 Sep 2008 10:25:49 +0000

A while ago I came across Create your Proxy DLLs automatically by Michael Chourdakis. I thought it was a good idea, but had some room for improvement:

Having to use an external .exe (dumpbin/tdump) was an unnecessary step, all the information you need is in the PE header!
He did not handle wrapping mangled names or forwarding forwards.
Generating an actual project instead of a command line compile call would be a lot more useful considering you will want to do some actual coding instead of generating an empty wrapper.
His coding style was somewhat awkward and not easy to modify.

With this in mind, I set about writing my own version. Dumping the export information from the PE header was the first step and relatively straightforward as the PE header is not complicated. I did, however, need to steal GetPtrFromRVA and GetSectionFromRVA from Matt Pietrek because the RVAs (Relative Virtual Addresses) are only relative when the sections are properly mapped out in memory, not while in the PE container! Those two functions also happen to be used for game hacks when the cheat-maker want to bypass any LoadLibrary & EnumProcessModules hooks by manually mapping a .DLL in to memory.

Handling export forwards is tricky if you don’t know what they are and simple once you find out. Since there are no flags to indicate if an export is a forward, and an export name can contain any non-zero character, the system needs a way to tell if “NTDLL.RtlAllocHeap” is a forward or not. The linker solves this by pointing the exported function address at the name of the export in the export table, so you merely need to check if the exported function address is within the export table or not.

Proxying name mangling is unfortunately not as simple. Say a .DLL exports

void __stdcall SimpleExport( void ) {
}

as a C++ function, resulting in the mangled export of “?SimpleExport@@YGXXZ”. Now say your proxy .DLL implements a stub and attempts to forward it (the @1 at the end of the export is telling the linker which ordinal to assign the export to):

.cpp file:

int __stdcall SimpleExport__YGXXZ() {
	...
}

.def file

?SimpleExport@@YGXXZ=SimpleExport__YGXXZ @1

When you compile your .DLL, you’ll be surprised to find that it is exporting “SimpleExport”, not “?SimpleExport@@YGXXZ”! What is going on?! It turns out the Microsoft linker checks the target function for name mangling (an @ symbol), and if the target function name isn’t mangled, formats the exported name as a C function. The only way that I’ve found to export a mangled name is to point it at a mangled name:

.def file v2

?SimpleExport@@YGXXZ=?SimpleExport__YGXXZ@@YGXH@Z @1

The good news is that this works great. The bad news is that if you alter your proxy function to match the source function, the name mangling changes and your .def file will need to be updated with the new name (not a trivial thing!). The “at least it works” solution I came up with is to have proxy functions for your proxy functions, i.e. functions whose name mangling will never change and which jmp to the real proxy function:

.cpp file:

int __stdcall SimpleExport__YGXXZ() {
	...
}

naked void __stdcall decorated1() {
	__asm {
		jmp SimpleExport__YGXXZ
	}
}

.def file:

?SimpleExport@@YGXXZ=?decorated1@YGXXZ @1

Now you can alter the calling conventions, parameters, and return type of the actual proxy function as much as you like and it won’t affect the linking.

A benefit the trouble mangled names produce is that, with the UnDecorateSymbolName DbgHelp API call, it is trivial to annotate the return type, calling conventions, and parameters of the proxied function; e.g. you can now see “?func1@a@@AAEXH@Z” actually means “private: void __thiscall a::func1(int)”.

My Version

There are a lot of ways to generate the wrapper and I thought Michael’s generated source also had room for improvement. For example, he doesn’t name any of the generated functions, and calling through to the original function requires a typedef (doesn’t work with intellisense), a non-obvious index in to the imported function array, and a cast. Extending the code so any function can call through to any original function would require the typedef to be global (which still won’t work with intellisense).

The main points I wanted to hit were:

Allow the original functions to be easily callable from anywhere
Require as few changes as possible to if you want to convert a stub to a proxied function
Intellisense has to work!

#1 meant creating dummy functions which jmp to their target proc. By creating them as:

inline __declspec(naked) int call_AcceptEx() {
	__asm {
		jmp dword ptr [ mProcs + 0 * 4 ]
	}
}

I’m not only able to stuff them in the header (keeping the source file clean), but naked means you can alter their parameters and return type as much as you like and don’t need to worry about how to use the FARPROC entry or which index to use.

To make one of the original functions globally callable, you only need to change the function signature in two places: The header stub, and the proxy function in the .cpp file. I would have loved to only have a single signature to change but wasn’t able to come up with a clean method. It’s “possible” with some god-awful macro magic, but I don’t even have a need for a wrapper at this point and the macros really obfuscate the code.

That is about it! More work could be done, but it’s decently functional now and I think I’ve been sitting on it for long enough now. My version generates projects for both Visual Studio 2003 & 2005 and attempts to load the original .DLL from the Windows system directory by default (fairly arbitrary, easy to change). There’s also a sample .DLL (dll.dll) with odd exports (a forwarded function, a mangled class function, a mangled C++ global function, and a __stdcall C function) so you can make sure any changes you make still compile fine.

Now I just need to find something to wrap with it..

Downloads

GitHub repo: genwrapper
Example output for sample DLL: genwrapper/dll

The id Tech 4 Script Compiler Sucks Hard!

floodyberry — Wed, 09 Jul 2008 20:23:05 +0000

Whoever did most of the work on the id Tech 4 Script Compiler, I’m calling you out! I’ll grant that you managed to write a major component of a successful commercial engine, but… it’s just so bad. What confounds me even more is that all of the engines after DooM III did almost nothing (effective) to try and fix it: If you check out the Quake IV, Prey, and ET:QW SDKs, they all have the same basic compiler with a couple things bolted on. The ET:QW guys did do a bit of work on it and tried to speed it up a bit, but “glacially slow” doesn’t seem like much of an improvement on “geologically slow”.

I first noticed how bad it was when I was doing the ET:QW -> Tribes stuff and started playing around with the scripts. Being so use to Tribes style scripting, two things hit me right off:

You have to exit the mission and recompile every script if you want to update a script you just edited. Ok, pretty annoying, but I’m just playing around so it should get easier.
On my AMD64 3200+, recompiling was “damn slow” (~20 seconds) in Release mode and “I’m going to read a book” (~60 seconds) in Debug. Issue #1 just got a lot more annoying.

How did they manage to develop on this for longer than 10 minutes before going crazy, let alone create an entire game? Apparently I was the first to get fed up enough about it, so I went to check out the compiler and see if I could find any hot spots.

Putting Out Dead Fires

What is most depressing about looking at the source is that you can tell someone tried to optimize it, but failed miserably.

There are static lists for some elements (strings, functions, globals, statements, objects). idList grows linearly, not exponentially, so I assume someone noticed adding XXXX elements was really slow, knew it must be idList resizing itself far too many times, but didn’t know of an actual solution. A static list also means the code will break if you add too many items.
A Hash is used for idVarDefs (a catch all struct to hold scope/info for variables, functions, objects, and constants), but it is based on name only, i.e. every time it looks up a variable named “i”, it gets a linked list of every variable named “i” and has to check every one to see if they match the scope of the “i” you want.
Even better, the list of constants is kept in the idVarDef list named “”, which means a lookup to see if a constant value exists requires that you iterate over every string constant, numeric constant, stack return constant, etc.
Splash Damage did add a Hash for the idTypeDef list in ET:QW, but only for the user defined types i.e. objects. The default types were still checked for with a cascaded if statement.
They also added blockAllocators for a couple elements, but allocations weren’t the main problem.

Active Volcanos

So what were the (major) problems?

Script_Compiler.cpp:

idCompiler::FindImmediate lumped all constants under a single name (), leading to a linear search every time one was looked up
idCompiler::CheckType checked for default types (string, boolean, virtual, etc) with a cascaded if statement.
idCompiler::GetExpression did a linear search through the opcode list
idCompiler::ParseReturnStatement did a linear search through the opcode list to find the “=” opcode

Script_Program.cpp

idVarDef was stored in a hash by name only. Each idVarDef is a unique pair of name & scope, so the entire list of idVarDefs would have to be searched for the proper scope member
The idVarDefs were also stored in a weird linked list thing that required constant maintenance
idProgram::FindType used a linear search even when the hash table was available
idProgram::MatchesVirtualFunction used a linear search through the virtual functions

The generous explanation is that nobody got around to storing things in hash tables properly because the idHashTable implementations suck. They’re hardcoded to be key’d on a string so you can’t construct custom keys (such as an object containing a name and a scope for idVarDefs).

What is especially ironic (maybe only to me) is that Carmack ran in to performance problems due to linear searches before with qcc! I have no idea if he contributed at all to the compiler, but it still tickles me.

Postmature Optimization

The main fix for this type of problem should be obvious: more hash tables! It took me around a week and a half or so to figure out what exactly was going on, identify the hotspots, and fix them. This involved lookups for the compiler default types, the opcode table, and the global virtuals table, as well as creating an idScopeName object to key the list of idVarDefs on.

To cope with the bloat the hash tables might introduce, I also used an idStrPool for all the strings inside idProgram.cpp. This had the added bonus of allowing pointer hashing and comparisons on the idPoolStr* since they are guaranteed to be unique.

If you remember that I said the idHashTable implementations were awful, they are! Luckily the Splash Damage guys created sdHashMapGeneric so I didn’t have to make my own or figure out what exactly the weird id stuff is doing (I still don’t understand it).

There were some structural changes that had to be made to accommodate all of this, but nothing earth shattering. I chopped out some useless classes and created new ones such as idTypeDef_Static (idTypeDefs need an idPoolStr* name now, but you can’t statically create one), etc.

Did It Get Faster?

Yes! I don’t really have a wide assortment of test systems, but these are the improvements I saw:

                           Default    Optimized     Speedup
FeaRog's Laptop(Debug):  ~180-300s         ~12s      15-25x
AMD64 3200+(Release):         ~20s          ~1s         20x
AMD64 3200+(Debug):           ~60s          ~4s         15x
E8400@3.6ghz(Release):       ~1.5s        ~0.3s          5x
E8400@3.6ghz(Debug):        ~32.5s        ~2.3s         14x

Memory usage did go up by ~0.5mb, but that’s really nothing when the compiler is using 6-7mb as it is. I also spiffed up idProgram::CompileStats to produce more detailed stats so you can see exactly where the memory is going (and the overhead on the various containers). I would have fixed idList to resize exponentially and gotten rid of the awful .setGranularity calls, but editing anything in idLib always forced nearly the entire solution to recompile so I let it go.

Fin

Download the source (for ET:QW SDK 1.5) and try it out! Simply unzip it to your ET:QW SDK source directory and re-compile, it should work flawlessly unless you’ve made conflicting modifications to the 1.5 SDK.

Tribes 1 Physics, Part Four: Explosions

floodyberry — Tue, 29 Apr 2008 17:30:23 +0000

Explosions (or more accurately, knockback), the final piece to the physics puzzle! If you implement all of the previous articles, make a disc launcher, plug in the apparently simple knockback force and radius from baseProjData.cs, and finally attempt to disc jump, you will be greeted with… a nice and wimpy boost. Playing around with the knockback force will only break explosions in new ways. Argh, you were so close! How could Tribes possibly screw this one up?

Basic Knockback

The basic idea for knockbacks from explosions is you want to take the position of the explosion and the object, scale the explosion force, and apply an impulse to the object in the appropriate direction. Generally this is implemented something like so:

Player.onExplosion( Vector3 explosion, float radius, float knockback, float damage ) {
    Vector3 distance = ( hitbox.center - explosion ), direction = distance.Normalize
    if ( distance.Length > radius )
        return

    float power = 1 - ( distance.Length / radius )
    ApplyDamage( power * damage )
    ApplyImpulse( knockback * ( power * direction ) / armor.Mass )   
}

power is where the magic happens and while it should be an exponential falloff, it is common to do linear because there is not much of a noticeable difference. It is also where Tribes deviates from the norm in a fairly big way.

Tribes Knockback

What is obvious when you implement the basic knockback is that it packs a much smaller punch than Tribes. What you might not notice is that the Tribes knockback up close, e.g. at your feet from a standstill, is actually weaker than from a bit of a distance, e.g. at your feet while you’re in the air after jumping. Whether this was intended as an aid to disc jumping or an accident of a botched formula is impossible to say, but it doubtless would have been harder to gain speed from disc jumps without this odd feature.

If you work out exactly what the Tribes knockback function does (not fun) and re-factor the formula (not fun and also confusing), you come out with a somewhat familiar power calculation. Instead of the basic ( 1 – ( d / r ) ), Tribes uses ( ( d / Max( d – minHitboxDimension, 1 ) ) – ( d / r ) ), i.e. instead of using a constant ( 100% – linear falloff ) for the power, Tribes does some weird calculations which I can only guess are trying to account for the hitbox in some way. This image illustrates the power falloff for Tribes, the power falloff for the basic formula, and the power falloff for the basic formula if it were boosted so players were able to disc jump properly:

This shows why the linear falloff formula cannot be fixed by boosting the knockback force without getting wildly inaccurate up close. Here is the proper formula:

Player.onExplosion( Vector3 explosion, float radius, float knockback, float damage ) {
    Vector3 distance = ( hitbox.Center - explosion ), direction = distance.Normalize
    float minbox = Min( Min( hitbox.Width.x, hitbox.Width.y ), hitbox.Width.z )
    float d = Max( distance.Length - minbox, MetersToUnits( 1 ) )
    if ( d > radius )
        return

    float power = ( distance.Length / d ) - ( distance.Length / radius )
    ApplyDamage( damage * ( 1 - d / radius ) )
    ApplyImpulse( knockback * ( power * direction ) / armor.Mass )   
}

For the Light Armor with a BOX_WIDTH of 0.5, BOX_DEPTH of 0.5, and BOX_HEIGHT of 2.3, minbox equates to Min( Min( BOX_WIDTH * 2, BOX_DEPTH * 2 ), BOX_HEIGHT ), or 1.0. The player origin is at the center of the player’s feet and the hitbox dimensions extend out from that, so the width and depth extend out in either direction and need to be doubled while the height goes from the feet to the head. Note that if you increase the smallest hitbox dimension, e.g. increase BOX_WIDTH and BOX_DEPTH to have a value of 2.3 when doubled, the linear ramp leading up to the downward falloff of the power will grow, presumably to account for greater surface contact area with the explosion.

You’re Done!

You should have all you need to get authentic Tribes 1 Physics going in your engine of choice! After this you should only need to create some dummy weapons, scale their velocities to whatever units you are using, add two flags, and you can get an honest game of LT going.

What? You don’t believe me that it works and you don’t want to waste your time implementing the physics only to find out that I lied? Hmmm, here’s something that should convince you. This is the full T1 physics running on Raindance in ET:QW. I did this for the ET:QW Tribes mod that was in the works, but unfortunately things kind of fizzled out. The map is authentic Raindance (every crevasse and triangle is there) with an 8192×8192 command map texture overlaid on top with Arcanox’s bases and bridge. This is currently the closest thing to Tribes you will ever see in another engine.

You can also download the hi-res version (15mb), although the burps are more apparent since my computer wasn’t quite up to spec to run QW. How I got the terrain in and textured will have to wait for another day.

Tribes 1 Physics Series

Part One: Overview
Part Two: Movement
Part Three: Collision
Part Four: Explosions

Tribes 1 Physics, Part Three: Collision

floodyberry — Sat, 12 Apr 2008 00:47:59 +0000

This article covers the collision physics of Tribes 1, i.e. attempting to actually move the player and what to do when the player runs in to something. This is the most convoluted part of the physics and requires a lot of little touches to get right. Tribes movement and collision handling actually gets a little too low level, so I won’t be able to show exactly how it works (it gets down to dealing with the raw triangle lists which I don’t think all collision detection systems will let you get at), but it will be detailed enough so that there will be no major differences.

Warning!

Before I get in to the details, a little explanation is required. When Tribes attempts to move an object, it takes the maximum distance the object will cover in the remaining time (X), then divides the remaining time by ceil(X) and does ceil(X) collision detection loops. This is ostensibly to ensure that an object only moves 1m at a time, but for what reason, I don’t know. The engine is obviously capable of fairly arbitrary translations (or else Gambase::GetLosInfo would not work) so I can only imagine many short translations proved to be more efficient than a single large translation.

I am going to be using a single translation, and normally this would not make a difference, but Tribes does something a little weird on collisions which necessitates a rather odd fix-up when you are using a single translation versus slicing them up. Instead of adjusting the position based on velocity when there is no collision and adjusting the velocity & setting the position to the collision point on a collision, Tribes adjusts the position based on the velocity no matter what. This means Tribes will attempt to move the player, hit a surface, adjust the velocity based on the collision, and then move the player anyway from their original position based on the new velocity, usually resulting in the player bouncing ever so slightly away from the contacted surface. There is actually a noticeable difference between the correct way and Tribes way of handling collisions, as the correct way feels slightly velcroish while Tribes feels more fluid.

Things get more complicated when you move the player in a single translation instead of sliced up translations as the weird adjustment on collisions is only done for the last fraction of the translation and not the entire thing. I worked around this by calculating some values which let me figure out how many “non-collision” slices have occurred and only do the weird handling on the last slice.

Collision Code

Player.UpdatePosition( float tickLen ) {
    float decayFriction = currentFriction * Physics.FRICTIONDECAY
    float lastSurfDirection = lastJumpableNormal.Dot( Gravity.upNormal )
    float timeLeft = tickLen
    int maxBumps = 4, bumps
    
    currentFriction = 0
    collisionLastTick = false
    
    for ( bumps = 0; ( bumps < maxBumps ) && ( timeLeft > 0 ); bumps++ ) {
        // slice fixup values
        Vector3 maxDistance = ( velocity * timeLeft )
        int iterations = ceil( UnitsToMeters( maxDistance.Length ) )
        float sliceTime = timeLeft / iterations
        
        // attempt to move through the world
        Vector3 originalPos = position, endPos = position + maxDistance
        moveFraction, finalPos, contactNormal = Physics.Translate( originalPos, endPos )

        // figure out how long we moved for and adjust the remaining time
        float duration = timeLeft * moveFraction
        timeLeft -= duration
        position = finalPos

        // did we move the entire distance safely?
        if ( !timeLeft )
            break
        
        // collisionLastFrame gets set even if we step up and don't have an actual collision
        collisionLastFrame = true
        float surfDirection = contactNormal.Dot( Gravity.upNormal )

        if ( surfDirection < armor.JUMPSURFACE_MINDOT ) {
            // code to handle potentially stepping up sheer surfaces
            if ( steppedUp )
                continue
        }
        
        float impactDot = -velocity.Dot( contactNormal )
            
        // take damage if needed
        if ( UnitsToMeters( impactDot ) > armor.MINDAMAGESPEED )
            OnDamage( ( UnitsToMeters( impactDot ) - armor.MINDAMAGESPEED ) * armor.DAMAGESCALE )

        // if we hit a jumpable surface, update the jumpable normal and reset the timestamp
        if ( surfDirection >= armor.JUMPSURFACE_MINDOT ) {
            if ( ( lastJumpableNormalTimestamp > ( Physics.TICKBASE * 1000 ) ) ||
                 ( surfDirection < lastSurfDirection ) ) {
                lastSurfDirection = surfDirection
                lastJumpableNormalTimestamp = 0
                lastJumpableNormal = contactNormal
            }
        }
        
        // do some voodoo for tribes collision adjustments and timeslices
        int impactIterations = ceil( duration / sliceTime )
        float fullMotionTime = sliceTime * ( impactIterations - 1 ) 
        float fixupTime = duration - fullMotionTime
        position = originalPosition + ( velocity * fullMotionTime )

        // bounce
        Vector3 bounce = ( contactNormal * ( impactDot + MetersToUnits( Physics.ELASTICITY ) ) )
        velocity += bounce

        // readjust position based on bounced velocity
        position = Physics.Translate( position, position + ( velocity * fixupTime ) )

        // only update friction on upward facing surfaces
        if ( surfDirection > 0 ) {
            currentFriction = surfDirection

            if ( crawledToStop && ( velocity < MetersToUnits( Physics.MINSPEED ) ) ) {
                velocity = Vector3( 0, 0, 0 )
                position = originalPosition
                break
            }
        }
    }

    if ( bumps >= maxBumps ) {
        // Tribes sets the velocity to 0 here, this is where skibugs happen
    }
    
    if ( collisionLastFrame ) 
        currentFriction = Min( Max( currentFriction, decayFriction ), 1 )
        
    return ( collisionLastFrame )
}

Notes

Ski bugs are caused when the translation loop exceeds the maximum number of collisions. When this happens, Tribes zeros the player’s velocity as it is having trouble successfully moving. I think there is something in the velocity bounce that occasionally causes the translation loop to get stuck running in to the surface over and over with no change in velocity. When this happens, there is obviously no right answer as to what to do, but zeroing the velocity is fairly annoying answer. I’ve found that just ignoring the situation and hoping the next tick results in the player getting dislodged appears to work much better.

Also note that I keep forgetting to add constants (MINDAMAGESPEED and DAMAGESCALE in this post) and need to update the original post to include them. Since nobody is reading this and I am taking a while in getting it together, I do not think anyone will mind.

Tribes 1 Physics Series

Part One: Overview
Part Two: Movement
Part Three: Collision
Part Four: Explosions

Tribes 1 Physics, Part Two: Movement

floodyberry — Sun, 24 Feb 2008 16:48:55 +0000

This article covers the movement physics of Tribes 1, i.e. jumping, jetting, and ground movement/friction. These account for ~90% of the “Tribes” feeling, although even 90% will still feel wrong to anyone who knows the authentic feel well. There will be some variables which are only set in the Collision code, but there shouldn’t be any confusion as to what they do.

Movement Code

There are a few spots where Tribes took shortcuts with it’s math and assumed that gravity would always be “0 0 -X”, saving a couple multiplies on dots and such. I’ve replaced these spots to be gravity agnostic as I don’t think the CPU savings are that big and they make the code difficult to understand. It shouldn’t be hard to “re-optimize” if you feel the need. Dot products will be a .Dot method because overloaded multiplication operators confuse me.

Player.Tick

Player.Tick starts off by taking the user’s movement input (left, right, forward, back) and creating the speed vector for walking and direction vector for jumping and jetting. This is pretty basic and there is nothing out of the ordinary here.

Player.Tick( Move move, int tickLenMs ) {
    float tickLen = ( 1000 / tickLenMs )

    float maxSpeed = MetersToUnits( armor.WALKSPEED )
    float forwardSpeed = MetersToUnits( armor.WALKSPEED ), sideSpeed = MetersToUnits( armor.WALKSPEED - 1 )

    move.speed = Vector3( ( move.right - move.left ) * forwardSpeed, ( move.forward - move.back ) * sideSpeed, 0 )
    if ( move.speed.Length > maxSpeed )
        move.speed *= ( maxSpeed / move.speed.Length )

    move.speed = self.ToWorldSpace( move.speed )
    move.direction = move.speed.Normalize
}

After this, the Player energy is updated, we check lastJumpableNormalTimestamp to see if a jump should be allowed, and then do the jumping, jetting, and walking. Most of the real work takes place in Jump, Jet, and Friction, so the tick function is fairly sparse.

A few things to note:

Tribes lets you jump up to 256ms after your last ground contact, allowing you to jump smoothly through little bumps and such where you technically leave the ground, but don’t really appear to.
Tribes handles jetpack energy a little differently, i.e. once you get to around 5% of your jets, they cut off and recharge a little causing the jets to stutter. I haven’t worked this part out yet, but may re-edit this section if I do.
Tribes applies gravity constantly, it is not hacked off when you are resting on a surface.
JETENERGY_CHARGE equals “8 + 3” with the “+ 3” being the recharge boost from an energy pack.

    bool isJetting = ( move.jetting && ( energy > 0 ) )
    bool isJumping = ( move.jumping && ( lastJumpableNormalTimestamp < Physics.MAXJUMPTICKS ) )

    // jump 
    if ( isJumping )
        velocity += Jump( move.direction )
    crawlToStop = ( velocity.Length < MetersToUnits( Physics.CRAWLTOSTOP ) )

    // jets and acceleration
    accel = Gravity.force
    energy += ( armor.JETENERGY_CHARGE * tickLen )
    if ( isJetting ) {
        accel += Jet( move.direction, tickLen )
        energy -= ( armor.JETENERGY_DRAIN * tickLen )
    }
    energy.Clamp( 0, armor.MAXENERGY )
    velocity += ( accel / tickLen )

    // walking and friction
    if ( collisionLastTick )
        velocity += Friction( move.speed, tickLen )

    // update jumpableNormal timestamp and try to move
    lastJumpableNormalTimestamp += tickLenMs
    /* position, velocity = UpdatePosition( tickLen ) */
}  

[/sourcecode]




Player.Jump




Player.Jump( Vector3 moveDirection ) {
    // need another ground contact before we can jump again
    lastJumpableNormalTimestamp = Physics.MAXJUMPTICKS

    // jump up
    float surfaceDirection = lastJumpableNormal.Dot( Gravity.upNormal ) 
    float impulse = MetersToUnits( armor.JUMPIMPULSE / armor.MASS )
    Vector3 jump = ( surfaceDirection * impulse ) * Gravity.upNormal 

    // if we're moving away from the surface, jump away
    float orientation = lastJumpableNormal.Dot( moveDirection )
    if ( orientation > 0 ) 
        jump += ( impulse * orientation ) * moveDirection

    return ( jump )
}

Player.Jet

Side jets only kick in if the player is holding down a movement key and it’s been longer than MAXJUMPTICKS since the last ground contact. Since jumping sets lastJumpableNormalTimestamp to the limit, jumping and jetting results in side jets being enabled immediately, while simply holding down your jets on the ground will give a little startup time of full jets regardless of whether a direction key is down.

Player.Jet( Vector3 moveDirection ) {
    float forwardVelocity = MetersToUnits( armor.JETFORWARD )
    float jetForce = MetersToUnits( armor.JETFORCE / armor.MASS )

    if ( ( lastJumpableNormalTimestamp >= Physics.MAXJUMPTICKS ) && ( moveDirection.Length != 0 ) ) {
        float sidePower
        float orientation = velocity.Dot( moveDirection )
        if ( orientation > forwardVelocity )
            sidePower = 0
        else if ( orientation < 0 )
            sidePower = armor.JETSIDEFORCE
        else
            sidePower = ( 1 - ( orientation / forwardVelocity ) )

        sidePower = Min( sidePower, armor.JETSIDEFORCE )
        Vector3 sideForce = ( sidePower * jetForce ) * moveDirection
        Vector3 upForce = ( ( 1 - sidePower ) * jetForce ) * Gravity.upNormal 
        return ( upForce + sideForce )       
    } else {
        // straight up, full jets
        return ( jetForce * Gravity.upNormal ) 
    }
}
[/sourcecode]




Vector.ProjectOntoPlane

This is needed for the gravity agnostic Friction function. I don't know how common it is.




Vector3.ProjectOntoPlane( Vector3 normal ) {
    this -= ( this.Dot( normal ) * normal ) )
}

Player.Friction

Multiplying GROUNDTRACTION by currentFriction is unfortunately not the magical “Friction = 0” that supposedly causes skiing. currentFriction is decayed every tick when there hasn’t been a ground contact, resulting in the ability to jet and slide against walls and ceilings without being slowed down by the contact friction.

ProjectOntoPlane is probably not needed since the player should always be oriented so the move will never include a component not in the gravity plane, but it doesn’t hurt to include it.

Player.Friction( Vector3 moveSpeed, float tickLen ) {
    Vector3 dampen = ( moveSpeed - velocity )
    dampen.ProjectOntoPlane( Gravity.downNormal )

    float traction = Min( currentFriction * armor.GROUNDTRACTION, 1 )
    float force = ( MetersToUnits( armor.GROUNDFORCE / armor.MASS ) * traction * tickLen )
    if ( dampen.Length > force )
        dampen *= ( force / dampen.Length )
    else
        crawlToStop = true

    return ( dampen )
}

Tribes 1 Physics Series

Part One: Overview
Part Two: Movement
Part Three: Collision
Part Four: Explosions

Tribes 1 Physics, Part One: Overview

floodyberry — Thu, 21 Feb 2008 04:04:35 +0000

Games have been trying replicate the feel of Tribes 1 for almost as long as it’s been out (Tribes 2 started development in mid-1999) and every single one of them has failed, usually miserably.

Tribes 2 physics are an abomination, although in hindsight it should have come as no surprise after the Base+ mod Dynamix play-tested in Tribes 1 to massive disdain. It’s as if they were trying to build on the success of Tribes 1 when it was 2 weeks old, not 2 years. Unfortunately for the majority of the community not in the beta, we didn’t find this out until after the game was paid for. Further mods such as Base++, Team Rabbit 2, and Classic attempted to rectify the situation yet were still only a pale ghost of Tribes 1.
Legends has always felt wrong despite how often they tweaked and twiddled the physics and boasted of having the original physics source code. If they did have the source, they either didn’t know how to implement it or didn’t have enough of the source to properly replicate all of the required physics.
Tribes Vengeance is so far removed from the feel of Tribes that it shouldn’t even enter the discussion. Jetting is wrong, air movement shouldn’t even exist, collisions are wrong, and skiing is a sick joke. I can only hope KineticPoet remembered what used to be and silently winced every time he sat down to work on the game.
The yet-to-be-released Fallen Empire: Legions appears to be following the “We’re not a Tribes 1 clone, so let’s make wacky changes to ram that home” mantra. Ideas like 6 way jetting (jetting down while in the air and laterally while on the ground), non-friction sliding a-la T:V, jetpack overdrive, a charge up sniper rifle, etc., sound like they could easily alter the game beyond good taste.

What all of these games ostensibly want is to appeal to Tribes 1 players, yet they attempt to accomplish this by using a different and/or completely arbitrary physics system, adding something that resembles a jetpack and skiing and hoping everyone likes it. While I don’t know if a carbon copy of Tribes 1 on a modern engine would be a success, I do know that almost any Tribes 1 veteran will be unsatisfied with any Tribes style game that does not replicate the feel of Tribes 1 regardless of how often the developers hide behind the claim of “not Tribes 1”.

These articles will solve that problem. I will provide everything needed for a 99% re-creation of the Tribes 1 physics on any 3d engine. There are a few pieces to the puzzle so I’ll be breaking the topics in to separate articles for movement, collision, and explosions. This article will go over the basics of the engine and document the structures and constants I’ll be using.

Basics

To start off, all units will be in meters, the default unit of Tribes. You will need to provide a separate function to convert the Tribes meters to your engine’s native units, e.g. if 1 unit in your engine of choice is roughly equal to an inch, you will need to convert from Tribes meters to inches, i.e. multiply ~39.37. This will ensure that the authentic Tribes 1 constants are being used and that nothing has been messed up to due to being mis-scaled by hand. I will use MetersToUnits and UnitsToMeters to indicate when a conversion needs to be made.

Tribes runs at 31.25 ticks a second, or 0.032s per tick. Altering the ticks per second will result in slightly different physics as the gravity is accumulated based on the tick length while jumping is a singular impulse, so as the tick length grows, gravity slowly subsumes the jump impulse. There are other less obvious calculations based on the tick length which will be altered as well; some of these can be accounted for, some not.

The coordinate system is Z up, and Vectors are assumed to have overloaded operators.

All code examples will be in a pseudo C-like language and should be easy to translate in to any engine.

Physics Workflow (details omitted)

Tick( Move move, float ticklen ) {
    if ( jumping and canJump )
        velocity += Jump( move.direction )

    accel = Gravity.force
    if ( jetting )
        accel += Jet( move.direction )

    velocity += ( accel / ticklen )
    if ( collisionLastTick )
        velocity += Friction( move.speed, ticklen )

    position, velocity = UpdatePosition( ticklen )
}

There is nothing special in Tribes 1 physics, skiing is not achieved by “Friction = 0”, and the only feature an engine needs is the ability to properly detect collisions. There are a lot of nuances required to achieve a simulation that doesn’t feel subtly wrong, but none are of the earth-shatteringly unique variety that many claim only the Tribes/Torque engine can replicate. They are, however, nuances that you are highly unlikely to match with pure guesswork.

Structures

These are the structures and values I will be using for detailed explanations of the physics workflow. Don’t worry if you don’t know what a value does just yet.

Gravity {
    Vector3 force = ( 0, 0, -20 )
    Vector3 upNormal = -force.Normalize
    Vector3 downNormal = force.Normalize
}

Physics {
    float TICKBASE = 0.032
    float ELASTICITY = 0.001
    float CRAWLTOSTOP = 0.1
    float MINSPEED = 0.75
    float FRICTIONDECAY = 0.6
    int MAXJUMPTICKS = 256
}

LightArmor : Armor {
    float MASS = 9
    float GROUNDFORCE = 9 * 40
    float GROUNDTRACTION = 3
    float WALKSPEED = 11
    float JUMPIMPULSE = 75
    float JUMPSURFACE_MINDOT = 0.2

    float MINDAMAGESPEED = 25
    float DAMAGESCALE = 0.005

    float JETFORCE = 236
    float JETSIDEFORCE = 0.8
    float JETFORWARD = 22
    float MAXENERGY = 60
    float JETENERGY_DRAIN = 25
    float JETENERGY_CHARGE = 8 + 3

    float BOX_WIDTH = 0.5
    float BOX_DEPTH = 0.5
    float BOX_HEIGHT = 2.3
}

Player {
    Armor armor
    Vector3 position, velocity
    float energy

    bool crawledToStop
    bool collisionLastTick
    Vector3 lastJumpableNormal
    int lastJumpableNormalTimestamp
    float currentFriction
}

Tribes 1 Physics Series

Part One: Overview
Part Two: Movement
Part Three: Collision
Part Four: Explosions

Writing a (Tribes 1) Master Server

floodyberry — Fri, 15 Feb 2008 15:59:10 +0000

While I wrote this in September 2007, for various reasons I did not get around to putting the finishing touches on it. Please pretend you’re reading it then and not now!

After the hubbub over Sierra’s announcement that they were ceasing multiplayer support for Tribes 1 and the resulting scramble to locate a replacement master server, I decided to give a shot at writing one. The required feature set appeared simple enough to only take a week or so to implement but with enough gotchas to keep it suitably interesting. While I only had a vague idea of what was required, I got a jump start on proper design by finding Half-Life and Team Fortress Networking: Closing the Loop on Scalable Network Gaming Backend Services by Yahn W. Bernier, an article detailing the design, implementation, and potential problems of the Half Life master server. Even though some of the topics did not apply to the Tribes 1 requirements, e.g. I can’t alter the client’s behavior to auto rate-limit the server list transmission, the article was still quite valuable and an interesting read even if you aren’t implementing a master server.

Getting Started

My initial server was as basic as you can get: a hash table of servers behind a udp socket which wakes up periodically to time out servers. Tribes servers do not send a “going offline” packet, but as their heartbeat period is every 2 minutes the server will not remain in the list for long.

If the intended environment was a closed setting with a limited number of servers to track, this would be more than sufficient, but in the real world it is vulnerable to many issues and attacks, intentional or otherwise.

Issues and Attacks

Bogus Servers: The ability to add bogus servers is probably the most damaging problem due to the ability to inflate the server list which makes it possible to easily saturate the master server’s upload by requesting the server list repeatedly. The two ways of adding bogus servers are either by IP spoofing or simply sending thousands of keepalives from your IP with different ports. The Half Life Networking article goes in to this problem in depth and offers the solution of a challenge-response system where the master server, on receiving a heartbeat, sends a random value to the server and only allows the heartbeat to be registered if the server sends the correct value back.

While the challenge/response system solves the problem of IP spoofing completely, it is still trivial to listen and respond to the challenges on your real IP. To defeat this attack, it is necessary to limit each IP to a certain number of registered servers. This way, even if you can intercept and respond to 20,000 challenges, only X will be accepted and the server list will be no worse for the wear. To be honest, limiting an IP to a low number of servers, say 3-5, should be in place regardless of the chance for spam. The only reason to run multiple servers on a single IP is either NAT or that the additional servers are backups/rarely used. Server CPU lag is one of the worst experiences to have online, although many mods run so poorly they tend to anticipate this fact and over-compensate with ridiculous weapons and items which leave skill completely out of the equation. That is a topic for another day though.

Of note is the fact that while Tribes network protocol does include a sequence number which can be used as a challenge, it is only 16 bits wide. To prevent brute force attacks (however unlikely), the challenge-response value should have a timeout of a few seconds at most.

Upload Starvation: If an IP is spamming list requests to the master server, it might be possible to saturate the master server’s upload even with a relatively small server list. I wanted to be able to address the problem of spamming requests without requiring human intervention, but at the same time not penalize legitimate users who may click the refresh button a few times too often or have multiple players behind NAT.

The solution, ironically, came from Tribes. I implemented a penalty per IP system similar to the in-game chat spam penalty. Each request accrues a certain amount of penalty, and when the penalty reaches a certain limit the master server stops responding to that IP. The penalty is decreased by 1 for each second that passes, allowing the client to eventually access the master again. I also added a maximum penalty cap (currently at 10 seconds) so that bans will be over fairly quickly once the IP ceases to spam.

Unintentional flooding: Because Tribes has no auto-rate limiting mechanism to control the server list download speed, it’s possible the master server could either upload the server list faster than the client can handle or saturate it’s own upload by firing off too many packets at once when multiple requests come in. The obvious solution is for the master server to throttle it’s upload, i.e. queuing up packets instead of sending them off instantly. While a simple rate limited FIFO queue would keep the server’s upload from becoming saturated, it would also mean that after a certain point the packets on the tail end of a large queue would be timed out by the client before they had a chance to arrive. This means rate limiting needs to be done per-connection and not globally.

Tribes, however, throws a wrench in to how low you can rate limit a connection. When the first packet of a master server response arrives, Tribes creates X pending “ping” responses to wait for, where X is the number of packets left in the master server response. Because both the master server list request and server ping packets are the same (\x10\x03), Tribes piggybacks the master server list request in the server pending ping list and simply re-pings any packet in the list which times out. Unfortunately, the timestamps for a particular set of master server response packets are never updated when succeeding packets come in, meaning that if Tribes does not receive every master server packet within $pref::pingTimeoutTime milliseconds (default 900ms), it sends a new request for each remaining packet in the list.

So lets say the server list contains 3000 servers, and the master server batches 64 servers per packet. This would result in 47 packets at around ~450 bytes each, or 21k of data. If the master server uploads at 4k/s to the client, only 4k*0.9s = 3.6k of data will make it to the client before the remaining 38 packets are timed out. Tribes would then see 38 “pings” that timed out and re-ping each one, updating their packet keys and timestamps so that the packets still coming in from the previous request would be discarded. If Tribes hits $pref::pingRetryCount on any single master server packet, it will decide it can’t contact a master server and either move on to the next master or pop up an error box. This unfortunately means the per-client rate limit has to be set high enough to make sure the server list can be transmitted within a couple seconds at most, or 900ms to be safe. Assuming a maximum of 500 servers, or ~4k of data, 5k/s per client should be a decent limit.

Due to a poor choice of data structures, Tribes itself limits how large the list can grow. The server list and pending ping list are stored in vectors which grow by 5 items at a time!, leading to tons of potential copying on vector resizes and painful O(N) lookups. When the “to ping” list reaches ~6000 servers, Tribes locks up for longer than I care to wait. This is exacerbated when you raise $pref::pingTimeoutTime to accommodate larger server lists, leading to a realistic ceiling of around 4000 servers regardless of the timeout bug.

Unresolved Attacks

Because Tribes has no challenge-response mechanism available for server list requests, there is nothing to be done about Upload Starvation by IP spoofing. Please try not to annoy anyone who has the ability to spoof and the desire to spam your master server!

Mirrors

I thought about doing something with mirrors, but there really is no good reason to bother.

1. CPU load is a non-issue. Even if there were 50,000 live servers sending heartbeats every 2 minutes, that’s only 415 heartbeats a second. A heartbeat requires 1 hashtable lookup for the IP penalty, 1 hashtable lookup for the server table, and 3 hashtable lookups for the pending servers table if the server doesn’t exist. A worst case of 5 hash table lookups * 415 = 2075 hashtable lookups a second which would not even register on a load monitor.

Using a separate process to spam my master with heartbeats, I can get up to ~50,000 heartbeats a second without sending a challenge packet and ~35,000 heartbeats sending the challenge (and using ~1000kb/s up in the process). Using a separate machine to spam heartbeats peaks at around ~14,000 a second with around ~30% CPU load for the master server.

A “realistic” test of 18,000 servers, 22 server list requests a second, and 450 heartbeats a second resulted in ~3-10% CPU load and 2,800 kb/s upload. Memory usage peaked at around 40mb, largely due to the 530 concurrent 120k server lists being served from memory at 5kb/s. Reducing the server count to 1,000 while keeping the requests and heartbeats the same results in no measurable CPU load and 900k peak memory usage.

2. Bandwidth, while an issue with a community supporting 50,000 servers, is not an issue with the Tribes community.

There are currently 121 active servers in the current master list.
According to archive.org’s snapshot of gamespy’s stats page, Tribes only had ~250 servers up in July 2004.
Going back even farther is one of Tim Sweeney’s old news posts from October 1999, showing 589 servers during Tribes’ heyday.

Lets do the math:

        Servers    Bytes / List     Req/s 40kb up   160kb up
1999        589            4323              9.47      37.90
2004        250            1950             21.01      84.02
2007        121            1047             39.12     156.49

Even a crappy cable modem can handle 39 reqs/s with the current list, and a T1 could do just as well with the server count from 1999. Obviously it makes more sense to find a host that will stay up than to make a tangled net of mirrors. There are also other problems with unorganized mirrors such as all of the mirrors a server reports to going down and the fact that the Tribes client queries mirrors sequentially instead of randomly.

If a host can’t be found that is guaranteed to be up, then finding 5 such hosts and needlessly complicating the master server isn’t going to make the system more robust.

Code!

The only thing I don’t really like about the current code is the packet queuing. While the current server can obviously scale quite high, it should be possible to construct packets on the fly instead of batching them up in memory. The problem to be solved is how to keep multiple iterators in a hash table which can have elements added/removed at any time, and since the master has to tell Tribes how many packets it will be sending, to not send any additional servers which are added after the time of the list request. I have some ideas, but I should probably post this first instead of continuing to put it off.

The code is cross-platform, although I don’t have anything other than Linux to test on so you will probably need to jump through some hoops to compile on BSD or OSX.

All of the tunable parameters are at the top of t1master.cpp and should be fairly self explanatory. I didn’t want any external dependencies so the MOTD is “hardcoded”, although you can pass it on the command line. I also didn’t see a point in backing up the server list at any point since the list will fully regenerate itself in 2 minutes. Note that I do randomize the hash seed so even if a spoofer floods the server with bogus IPs, he won’t be able to logjam the hash tables with collisions!

t1spam.cpp is the program I used for stress testing by hammering the server with heartbeats and listreqs. You should see the lines to comment/uncomment in t1master.cpp(masterserver::process) and optionally in t1master.cpp(main) to simulate random request sources to get realistic loads for the hash tables and penalties.

GitHub Repo: t1master

C++ Templates and Class Inheritance

floodyberry — Thu, 17 May 2007 06:03:31 +0000

The following code is not legal C++:

template < class type >
struct A {
	void f() {}
	type mX;
};

template < class type >
struct B : public A {
	void g() { mY = ( mX ); f(); }
	type mY;
};

The best part is that unless you know the obscure reason why it is not legal, it appears legal and might even compile and run perfectly depending on which compiler you’re using. Not surprisingly, that is exactly how I ran in to it. I was doing templated class inheritance and thought I was in the clear because everything ran fine with MSVC7.1 and ICC 9, but when I belatedly tried to compile with g++ 3.4.4, I ran in to the following errors:

tmpl.cpp: In member function `void B::g()’:
tmpl.cpp:9: error: `mX’ undeclared (first use this function)
tmpl.cpp:9: error: (Each undeclared identifier is reported only once for each function it appears in.)
tmpl.cpp:9: error: there are no arguments to `f’ that depend on a template parameter, so a declaration of `f’ must be available
tmpl.cpp:9: error: (if you use `-fpermissive’, G++ will accept your code, but allowing the use of an undeclared name is deprecated)

What the who? Why can’t it find mX or f? How are you supposed to inherit templated classes in g++? Why does it work on some compilers and not others? Is this a bug? The answer is C++ Standards and in this case, gcc 3.4. It didn’t take me long to find a response to this exact same problem on the gcc mailing list from 2004:

re: gcc 3.4 template problem

This happens because gcc 3.4 now implements two-phase name lookup, see http://gcc.gnu.org/gcc-3.4/changes.html and page down until you see the 2nd example in the C++ section, which is much like yours.

Also see DR 213 http://anubis.dkuug.dk/jtc1/sc22/wg21/docs/cwg_defects.html#213 whose resolution adds:

In the definition of a class template or a member of a class template, if a base class of the class template depends on a template-parameter, the base class scope is not examined during unqualified name lookup either at the point of definition of the class template or member or during an instantiation of the class template or member.

to the standard.

You can also use ‘this->x’ to make x dependent, making the implementation to look into the base class.

The C++ FAQ Lite has some further information on the subject with the understandable disclaimer “This might hurt your head; better if you sit down”: [35.19] Why am I getting errors when my template-derived-class uses a member it inherits from its template-base-class?.

While I can understand what the standard is saying and the consequences of it, I can’t figure why some compilers still allow the now illegal behavior (leading to portability nightmares if you don’t know what the heck just broke), and why the intuitive interpretation was made illegal in the first place. Unfortunately, supporting the intuitive version when the standard says the opposite only opens up more opportunities to create non-portable code. While I dislike “this->” litter in my classes, it looks like the cleanest portable solution that doesn’t open up other issues (such as explicit A:: prefixing which would break virtual functions).

If you haven’t had enough, try making sense of this thread in comp.lang.c++.moderated: Dependent names in templates, or are they? What a horrid little rule.

UTF-8 Conversion Tricks

floodyberry — Sat, 14 Apr 2007 08:04:03 +0000

UTF-8 is a wonderfully simple encoding format with some very nice properties, but the juggling required to convert to UTF-16, and UTF-32 can be a little tricky and fairly easy to do poorly. This is further compounded by the various error conditions you must keep an eye out for, such as overlong encodings, reserved ranges, surrogate markers, incomplete sequences, and so on.

These are a couple tricks you can employ to hopefully keep the conversion fast and robust.

Tail Length Lookup

Our first trick is to use a lookup table for the initial byte. This allows you to both a) tell whether the byte is valid (80 to bf and fe to ff are invalid leading bytes, as well as f5 to fd if you don’t want to handle 5 and 6 byte sequences) and b) determine the number of trailing bytes in the expected sequence. We will also need the length of the sequence to quickly ensure there are enough bytes in left in the input as well as for other upcoming tricks, so this actually results in multiple wins.

If you want to cut down on the table size, you could use 128 values and take (c<<1), or 64 values and take ((c-0x80)<<1), although you’ll need an extra check for 80-bf with 64 values.

const UTF32 Replacement = ( 0xfffd );

const unsigned char UTF8TailLengths[256] = {
	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
	1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
	1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
	2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
	3,3,3,3,3,3,3,3,4,4,4,4,5,5,0,0
};

UTF32 utf8_to_utf32( UTF8 *&s, const UTF8 *end ) {
	UTF32 c = ( *s++ );
	if ( c < 0x80 )
		return ( c );
	unsigned int tail = ( UTF8TailLengths[ c ] );
	if ( !tail ) || ( s + tail > end )
		return ( Replacement );

Overlong Encodings and Magic Subtraction

Once we know the length of the expected sequence, we can attempt to decode it.The basic decoding loop is something like:

c &= ( 0x3f >> tail );

unsigned int i;
for (  i = 0; i < tail; ++i ) {
	if ( ( s[i] & 0xc0 ) != 0x80 )
		break;

	c = ( ( c << 6 ) + ( s[i] & 0x3f ) );
}

s += i;
if ( i != tail )
	return ( Replacement );
[/sourcecode]

At the end of decoding, we will still be faced with a problem: How do you tell if it was an overlong encoding? To keep the mapping of UTF-8 to UTF-32 one to one, we are required to reject any encoding that uses more bytes than it requires. Markus Kuhn's utf8_check.c has a jungle of conditionals to detect the specific lead and tail byte encodings that would indicate an overlong encoding, but this is not something we want to do in our innerloop!

This is where our Overlong Encoding and Magic Subtraction lookup comes in. Since we know the length of the tail, we can create a lookup of the minimum value a sequence with tail bytes needs to be.

Magic Subtraction is a side bonus to knowing the length of the tail. With Magic Subtraction, we can skip masking off the lead byte as well as eliminating the &0x3f mask in the innerloop! Magic Subtraction works by accumulating the value of the masked off bits into a single value, and subtracting that value at the end. Because we're making sure each byte is well formed, we can be sure that the masked off bits will add up to a constant value. I got this trick from ConvertUTF.c by Mark E. Davis.

If you want to double check the magic subtraction values, you can calculate them yourself like so: Find the constant value for the lead byte of each sequence, then for each byte in the sequence, shift the value over 6 bits and add 80.

	1 tail byte: (c0<<6)+80
	2 tail bytes: (((e0<<6)+80)<<6)+80
	3 tail bytes: (((((f0<<6)+80)<<6)+80)<<6)+80
	etc.


struct UTF8Lookup {
	UTF32 mOverlongMinimum, mMagicSubtraction;
} const UTF8Lookups[ 6 ] = {
	{ 0x00000000, 0x00000000 },
	{ 0x00000080, 0x00003080 },
	{ 0x00000800, 0x000E2080 },
	{ 0x00010000, 0x03C82080 },
	{ 0x00200000, 0xFA082080 },
	{ 0x04000000, 0x82082080 },
};

unsigned int i;
for ( i = 0; i < tail; ++i ) {
	if ( ( s[i] & 0xc0 ) != 0x80 )
		break;

	c = ( c << 6 ) + s[i];
}

s += i;
if ( i != tail )
	return ( Replacement );

const UTF8Lookup &lookup = UTF8Lookups[ tail ];
c -= ( lookup.mMagicSubtraction );
if ( c < lookup.mOverlongMinimum )
	return ( Replacement );
[/sourcecode]
Tail Byte Error Bits
You may have noticed that we are checking every single tail byte to see if it is well formed ( *s & 0xc0 != 0x80 ). Even if we used a switch on tail to unroll our loop, we would still need to have all the conditionals. The Tail Byte Error Bits trick is what I came up with to remove the checks.

If we have a lookup table that has 1 for invalid tail bytes and 0 for valid bytes, we can accumulate these values in a mask variable which will be non-zero at the end of our decoding loop if any of the tail bytes were invalid. Further more, if we accumulate them so that ( mask = ( mask << 1 ) | UTFInvalidTailBytes[ s[i] ] ), we can also tell which was the first invalid tail byte by looking at which bits in mask are set. This allows us to back the source pointer up to the last valid byte.

Feb. 27, 2008: Re-looking at this, I realized you can shrink the table to just 4 values since only the top 2 bits are being checked. This leaves us with ( mask = ( mask << 1 ) | UTFInvalidTailBits[ s[i]>>6 ] ).


const unsigned char UTF8InvalidTailBits[4] = {
	1,1,0,1,
};

const unsigned int UTF8InvalidOffset[32] = {
	0,1,2,2,3,3,3,3,4,4,4,4,4,4,4,4,
	5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,
};

unsigned int mask = ( 0 );
for ( unsigned int i = 0; i < tail; ++i ) {
	c = ( c << 6 ) + s[i];
	mask = ( mask << 1 ) | UTF8InvalidTailBits[ s[i]>>6 ];
}

s += tail;
if ( mask ) {
	s -= UTF8InvalidOffset[ mask ];
	return ( Replacement );
}

const UTF8Lookup &lookup = UTF8Lookups[ tail ];
c -= ( lookup.mMagicSubtraction );
if ( c < lookup.mOverlongMinimum )
	return ( Replacement );

[/sourcecode]
Finishing Up
If we made it this far, we can be sure that our UTF-8 sequence is valid. Well, almost valid. There are still certain UTF-32 values that, even when properly encoded in UTF-8, are still illegal, and must be checked for if you want a robust converter.

The offending values:



	d800-dfff: UTF-16 uses d800-dfff to encode it's surrogate pairs, i.e. values that don't fit in to 16 bits. This means UTF-8/UTF-32 are not allowed to encode these values.
	fdd0-fdef: This range was added to make your life more difficult.
	xfffe-xffff: x ranges from 0 to 10 (hex), so the values to check for are fffe-ffff, 1fffe-1ffff, etc.
	> 10ffff: 10ffff is the highest Unicode codepoint.



The first two ranges can be checked using a subtraction and a compare instead of two compares, a trick I got from uClibc's wchar.c. The third range can be checked with a simple and.. and compare. The last range (10ffff) is a simple compare.


bool InRange( UTF32 c, UTF32 lo, UTF32 hi ) { return ( (UTF32 )( c - lo ) < ( hi - lo + 1 ) ); }

bool IsSurrogate( UTF32 c ) { return ( InRange( c, 0xd800, 0xdfff ) ); }
bool IsNoncharacter( UTF32 c ) { return ( InRange( c, 0xfdd0, 0xfdef ) ); }
bool IsReserved( UTF32 c ) { return ( ( c & 0xfffe ) == 0xfffe ); }
bool IsOutOfRange( UTF32 c ) { return ( c > 0x10ffff ); }

This can be further simplified if you want to make use of a rather excessive lookup table. If you initialize a 32k entry lookup table like so, you can check any value under 10000 with a simple lookup.

for ( UTF32 c = 0; c < 0x10000; ++c )
	BMPInvalid[ c >> 1 ] = ( IsSurrogate( c ) || IsNonCharacter( c ) || IsReserved( c ) );

bool IsInvalidBMP( UTF32 c ) { return ( BMPInvalid[ c >> 1 ] ); }

Cheating

Ok, all we have to do is paste this all together and we’ll have a lightning fast UTF-8 to UTF-32 converter, right? Well.. consider the following example: You have a 1000k UTF-8 file which is 90% ASCII that you want converted to UTF-32. Would you rather convert it with a function that takes 30 cycles for an ASCII character and 50 cycles for a multibyte character, or a function that takes 10 cycles for an ASCII character and 90 cycles for a multibyte character?

If you have an input agnostic converter and/or unlucky compiler optimizations, your conversion function can end up looking something like:

mov esi, [ start ]
mov edi, [ dest ]
cmp esi, [ end ]
jge finished
loop:
	movzx eax, byte ptr [ esi ]
	inc esi
	cmp eax, 80
	jle ascii

	...130 bytes of code to handle multi-byte encodings...

ascii:
	mov [ edi ], eax
	add edi, 4
	cmp esi, [ end ]
	jl loop
finished:

Yuck! Instead of a nice tight loop for the easy ASCII case, this will most likely crap all over the cache and slow you down. If you’re especially unlucky when trusting your compiler to handle templates and functions that should obviously be inlined, you’ll even end up with a call or two per character. Doing a little optimization and tweaking can result in code like:

while ( start < end ) { while ( ( *start < 0x80 ) && ( start < end ) ) { *to++ = *start++; } if ( start < end ) { ... } } [/sourcecode] [sourcecode language='cpp'] mov esi, [ start ] mov edi, [ dest ] cmp esi, [ end ] jge finished loop: movzx eax, byte ptr [ esi ] inc esi cmp eax, 80 jge notascii mov [ edi ], eax add edi, 4 cmp esi, [ end ] jl loop jmp finished notascii: ...130 bytes of code to handle multi-byte encodings... mov [ edi ], eax add edi, 4 cmp esi, [ end ] jl loop finished: [/sourcecode] This function should cut through ASCII oriented UTF-8 like butter, even if the multi-byte handling is a little slower than a more optimized converter. This code re-working may have little to no gain if your input is highly varied, but if you have a good idea what you'll be facing, it may be worth it to tweak your functions to the data.

5 and 6 byte sequences

The original UTF-8 specification allowed for 5 and 6 byte sequences (up to 31 bits of data), however, only up to 4 byte sequences are valid under RFC 3629. So what do you do with 5 and 6 byte sequences? You can interpret the entire sequence and dump it as a single invalid character, or dump an invalid character for every byte in the sequence. Since the lead byte for 5 and 6 byte sequences (f5-fd) will never appear in any 4 byte or shorter sequence, interpreting the sequences as a single (invalid) character appears to make the most sense:

If you are not actually processing UTF-8 text, or your input is corrupted, it won’t matter how you interpret them as any interpretation will produce garbage
If you are processing valid UTF-8 text, they can only appear due to an intentional 5 to 6 byte sequence. While illegal, it still represents a single character, not 5 to 6 invalid characters.

Fin

dreamprojections’ wonderful Syntax Highlighter was a contributing factor to the length of this post.

Breaking SuperFastHash

floodyberry — Thu, 29 Mar 2007 08:31:40 +0000

After the problems SuperFastHash had in Hash Algorithm Attacks, I decided to try and break it completely, i.e. generate collisions algorithmically instead of brute forcing them. The attempt was more successful than I had anticipated, although Paul is obviously aware of the weak mixing in the final bits as evidenced by his comment in the source code, “Force ‘avalanching’ of final 127 bits”. My favorite collisions encountered would have to be “10/4 < 3”, “10/5 = 2”, and “10/6 > 1”, which have the property of hashing to the same value while being mathematically correct!

As I was writing this, I came up with a way to attack Bob Jenkins’ lookup3 as well. Unlike SuperFastHash, the lookup3 attack is due to the way the input bytes are being read in and does not indicate a deficiency in the mixing itself. If you are using lookup3 with hash tables, the core function will still be quite safe; it will only need to be modified if you are using it to generate unique 64bit identifiers and the input data could be altered for a malicious purpose.

With that said, let’s look at the attacks:

SuperFastHash

To begin with, I separated the innerloop in to its 4 steps:

1:	hash  += get16bits (data);
2:	tmp    = (get16bits (data+2) << 11) ^ hash;
3:	hash   = (hash << 16) ^ tmp;
4:	hash  += hash >> 11;

and then ran a couple of known duplicate strings through and printed out the value at each step (input bytes are only consumed in steps 1 and 2):

“aaaaaaaa” vs “aaadadaf”

aa 1: 00006169 aa 2: 030b6969 3: 62626969 4: 626eb5b6 aa 1: 626f1717 aa 2: 61641f17 3: 76731f17 4: 7681ed7a

aa 1: 00006169 ad 2: 03236969 3: 624a6969 4: 6256b2b6 ad 1: 62571717 af 2: 61641f17 3: 76731f17 4: 7681ed7a

“ifkzihfe” vs “igdqhtfp”

if 1: 00006671 kz 2: 03d33e71 3: 65a23e71 4: 65aef2b8 ih 1: 65af5b21 fe 2: 66846b21 3: 3da56b21 4: 3dad1fce

ig 1: 00006771 dq 2: 038b4771 3: 64fa4771 4: 6506e6b9 ht 1: 65075b21 fp 2: 66846b21 3: 3da56b21 4: 3dad1fce

We can see that after round 2 step 1 (2:1) the lower 16 bits of hash are the same, and that after (2:2) all 32 bits of tmp are the same. Looking at step 3 reveals how this creates a collision: hash = (hash << 16) ^ tmp;. The upper 16 bits of hash are thrown away, leaving the lower 16 bits (which are the same in the observed collisions) and 32 bits of tmp (which is the same in the observed collisions). Thus if after an initial 4 bytes we can find 2 bytes that create the same lower 16 bits in hash at (2:1), and then find the remaining 2 bytes that create the same 32 bits in tmp at (2:2), we have generated a colliding 8 byte string.

This can be further generalized to any string that’s a multiple of 8 bytes. If, from an initial hash value of X (the length of the string for SuperFastHash), you can find 8 bytes that collide with hash Y, then you simply use Y as the initial hash value and re-work the problem for the next 8 bytes, and so on. This works because SuperFastHash has very poor mixing for last 8 bytes that have been hashed and does not use it’s avalanching fix-up until it has reached the end of the input.

Attack

The attack was fairly straightforward once I identified what needed to be done. After a couple of revisions and refinements, I came up with:

Take a source string S that has a length which is a multiple of 8 bytes
Hash S and for each 8 byte sequence, find the values of hash and tmp at steps (2:1) and (2:2)
Hash every permutation of the first 4 bytes of your attack string:
- Find bytes 5 and 6 for each permutation which generate the lower 16 bits for the constant in (2:1).
- Finally find bytes 7 and 8 for which generate the constant in (2:2). If this succeeds, either print the hit or process another 8 bytes until you reach the target length for your source string

This will probably not uncover every possible duplicate, especially for keys longer than 8 bytes, but it ran fast enough and generated enough duplicates that I did not need to refine it further.

Results

When using an 8 byte source string with all possible characters as input ( bytes 0x00 to 0xff ), ~130,000,000 colliding strings can be found in around 3 minutes. When you restrict the character set to printable charactersI (symbols, numbers, uppercase, and lowercase), anywhere from 10,000 to 200,000 (possibly higher) can be found in a few seconds.

Using a random initial value (keying the hash) reduces the collisions, but does not alleviate them entirely. For example, 108,600 strings were generated that hash to the same value as “zzzzzzzz”. When run through a hashtable insertion test with 131072 buckets using “0xdeadbeef + length” as the initial value instead of just “length”, there were still 3,374,795 compares due to full hash collisions, and the largest bucket had 376 links. By comparison, Bob Jenkins’ lookup3 had 1 compare and 8 links in it’s largest bucket, and the x31 variant of the Torek/DJB hash had 0 compares and 6 links in it’s largest bucket. A small win is that increasing the key length does reduce the collisions with a keyed hash: 100,000 strings hashing to the same value as “zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz” resulted in only 500,000 compares with the largest bucket containing 158 links.

lookup3

The lookup3 attack is very simple and I actually feel dense for not having seen it until now. This is the basic 32bit little endian algorithm (32bit aligned reads) with input lengths assumed to be 24 or larger and a multiple of 12:

1:	a = b = c = ( 0xdeadbeef + len )

2:	for ( ; len > 12; len -= 12, key += 12 ) {
		a += ( key[0] )
		b += ( key[4] )
		c += ( key[8] )
		mix( a, b, c )
	}

3:	c += ( key[8] )
	b += ( key[4] )
	a += ( key[0] )

4:	final( a, b, c )
	
	return ( c );

mix() and final() mix every bit in a, b, and c very thoroughly, so trying to attack lookup3 algorithmically would be quite futile (for me). However, because every 12 bytes, or 96 bits, of input directly alters every bit of a, b, and c, there is a shortcut:

Take any input string and run it through lookup3; record the values of a, b, and c after Step 3
Take your attack string and pad it to be a multiple of 12 bytes long, then add an additional 12 bytes
Run your attack string through lookup3 and stop after Step 2. We should now have the additional 12 bytes remaining to hash
Construct the final 12 bytes so that key[ 8 ] = ( target.c – c ), key[ 4 ] = ( target.b – b ), and key[ 0 ] = ( target.a – a ). The internal state will now match that of the target string before the call to final(), guaranteeing a collision

Now, this attack relies completely on knowing what a, b, and c are initialized to at the start of the hash. If you are using lookup3 for hash tables, you should already be initializing a, b, and c to random values to defeat bucket attacks, i.e. attacks searching for keys where the lower 15 to 20 bits match. Even the best algorithm is vulnerable to bucket attacks, so choosing a random initial value should be mandatory no matter what.

However, if you’re using lookup3 for something like a 64bit unique id or a file checksum, your initial state will need to be static and thus open to attack with this method. I’m not sure what you could do to get around this safely while retaining lookup3’s high speed; adding another 32 bits of state that wasn’t directly altered by the input may help, but where to stick it and how to mix it in isn’t something I couldn’t guess at. I’ve emailed Bob, but I don’t know if deliberate attacks like this are something he is concerned with. The lookup3 source does state Use for hash table lookup, or anything where one collision in 2^^32 is acceptable. Do NOT use for cryptographic purposes., so we’ll see.

Conclusions

I already knew SuperFastHash had some peculiar results from my previous tests, and the outcome of this experiment drove the point home. While the offending collisions will not be common (what pathological input is?), the fact that they exist, were so readily obtained, and were still somewhat evident when changing the initial value of the hash suggests that it is probably best to start over from scratch. I should have noticed a problem sooner when SuperFastHash was running in to light collisions in When bad hashing means good caching, and that wasn’t even an intentional attack.

As far as lookup3, the trivial collisions are disturbing, but they are only a problem if an attacker can craft input for the hash function, and then only when you are using it to generate unique ids. It would be nice to see the issue addressed, though, as lookup3 is still unrivaled in terms of mixing and actually faster than SuperFastHash in some cases.

Source

You may download the SuperFastHash attack source and try it out for yourself. It comes preloaded with attacks against “********”, yielding 194 collisions, and “10/5 = 2”, yielding my favorite set of collisions. If you want to run it with all possible characters as input, make sure to turn the debug printf off or redirect output to a file; it can take a while to end process a program that has a lot of \x07 bells queued up.

When Bad Hashing Means Good Caching

floodyberry — Wed, 07 Mar 2007 18:54:18 +0000

I was testing various string hashing algorithms on chained hash tables, primarily to look at the bucket distribution and number of key comparisons with both prime and power of 2 sized tables. Each table node was set up to remember it’s full hash value so bucket collisions would only drop to a key comparison on a true key collision. I initially wasn’t concerned with run times, but I tacked on a timer anyway so I could get a quick metric on how collisions and distribution were affecting performance and wound up running in to a rather odd situation. For the purposes of this article, the tables were also pre-sized before inserting keys to avoid resizing overhead.

While I was testing many algorithms, only two are needed for this article: The string hash from STLPort 4.6.2 and Paul Hsieh’s SuperFastHash. STLPort’s hash is simply “hash = ( hash * 5 ) + string[ i ]” which should be fairly fast but produce weak distribution on large amounts of clustered keys. SuperFastHash should also be quite fast while having much stronger hashing no matter the key size or clustering.

The dataset that caused the problem was a simple 6 character sequential permutation in the form of:

[ABCDEFGH][EFGHIJKL][IJKLMNOP][MNOPQRST][QRSTUVWX][UVWXYZ01]

e.g.

AEIMQU
AEIMQV
AEIMQW
AEIMQX
AEIMQY
AEIMQZ
AEIMQ0
AEIMQ1
AEIMRU
AEIMRV
AEIMRW
AEIMRX
….
HLPTWZ
HLPTW0
HLPTW1
HLPTXU
HLPTXV
HLPTXW
HLPTXX
HLPTXY
HLPTXZ
HLPTX0
HLPTX1

(262144 unique keys in all)

I was hoping to ferret out the weaker algorithms with this dataset and boy did I ever. Almost all the stats were what I expected: STLPort had ~10x bucket density for both prime/pow2 table sizes (i.e. an average of 10 extra links per filled bucket), SuperFastHash had 0.77x for prime and 0.64x for pow2; STLPort had 1.5 million key compares due to full key hashes colliding, SuperFastHash had 0.006 million (6000). Now this is where it got weird: The STLPort powered hash table ran ~55% faster on my AMD64 3200+ than the SuperFastHash powered hash table. The results were similar on my AMD 1.1ghz TBird with the STLPort hash table running ~36% faster than SuperFastHash. The raw difference on both CPUs was around the same, 90ms vs 140ms on the AMD64 and 190ms vs 260ms on the Tbird. Something was definitely up.

Initial profiling was not very helpful. The STLPort hash algorithm ran faster than SuperFastHash, but the total difference was a few milliseconds at most even though SuperFastHash’s weakness is in smaller keys where it can’t take advantage of it’s streamlined innerloop. As far as the Insert method for the hash table, the 1.5 million string compares took ~17ms which gave SuperFastHash a slight advantage when checking for duplicates, but even when I stripped out the duplicate checking code (all the keys were unique so it was safe to do so) STLPort still had a healthy advantage.

With the hash functions coming out equal and the duplicate checking code taken out, there really wasn’t much left to the method at all. The offending statements came down to

node = new Node( key, value, *head, hash );
*head = node;

which simply allocates a new node with a simple constructor (key, value, next, hash) and points the head of the bucket to our new node. Since the hashtable is using for it’s key/value pair, this is just couple dword copies to initialize the node and update the bucket list, so how could this be slowing our code down so badly? I suspected the allocator might have something to do with it after noticing that running multiple tests in a single session resulted in increased runtime for each successive test (due to accumulated news/deletes?) for VS7 and to a lesser degree for gcc 3.4.4. I dropped in the Hoard Memory Allocator and while the code sped up a bit and multiple tests were possible in a single run with no degredation, STLPort still held the same lead (STLPort down to 55ms and SuperFastHash down to 105ms). I was about at my wits end when I decided to dump the actual bucket indices for each key to see where they were being placed:

STLPort:
273830
273831
273832
273833
273834
273835
273793
273794
273835
273836

SuperFastHash:
4267
391151
77466
272914
376435
77856
225677
489782
292888
97819
9443

Whoops. STLPort’s bucket accesses were nearly linear and all clustered around the same area (only the middle 1/10th of the buckets were being used), while SuperFastHash’s accesses were (properly) random across the entire list. STLPort’s string comparisons were also linearly clustered due to being allocated and inserted in alphabetic order:

CKKNTX – CJPNTX
CKKNTX – CJOTW0
CKKNTX – CJOSTX
CKKNTY – CKJTW1
CKKNTY – CKJSTY
CKKNTY – CJPPR1
CKKNTY – CJPOW1
CKKNTY – CJPNTY
CKKNTY – CJOTW1
CKKNTY – CJOSTY
CKKNTZ – CKJSUU
CKKNTZ – CKJSTZ
CKKNTZ – CJPNUU
CKKNTZ – CJPNTZ
CKKNTZ – CJOSUU
CKKNTZ – CJOSTZ

When the list was randomized and inserted again (under Hoard), SuperFastHash stayed at a cool 105ms while STLPort shot up to 460ms. STLPort gained ~360ms from the now non-linear string comparisons and ~40ms from the non-linear bucket accesses. Same number of compares and resulting bucket contents as with the linear dataset, but now with huge penalties for STLPort’s cache misses.

Lessons learned

Test results can be meaningless if you don’t understand exactly what you’re testing.
Unless one of the algorithms being tested is explicitly taking advantage of cache lines, it is possible to get highly bogus results if you don’t generate your test data well. Before this I hadn’t even considered that caching could have this kind of effect on chained hashtables.
Hashing algorithms giving the fastest possible time for a particular dataset isn’t as important as giving the least worst time unless you are certain it will absolutely not be used for anything else. A flawed algorithm can even be dangerous as illustrated by Scott Crosby and Dan Wallach’s Denial of Service via Algorithmic Complexity Attacks.

Source Code

You may download the C++ Source Code demonstrating the problem and try it out for yourself. While the STLPort execution time should balloon when switching to the randomized dataset, it won’t always be faster than SuperFastHash on the linear dataset. On a K6-3 450mhz STLPort is actually slower: 580/1500ms for the linear/random datasets versus 500/500ms for SuperFastHash.

Example output on my AMD64 3200+:

This software uses the Hoard scalable memory allocator (version 3.5.1, libhoard).
Copyright © 2005 Emery Berger, The University of Texas at Austin, and the University of Massachusetts Amherst.
For more information, see http://www.hoard.org
starting..
elapsed for STLPort-4.6.2/string-6-linear PRIME 63, comps 1581380, sdev 10.345448, active bins 27357, items 262144

elapsed for STLPort-4.6.2/string-6-linear POW2 63, comps 1581380, sdev 10.345448, active bins 27357, items 262144
elapsed for STLPort-4.6.2/string-6-random PRIME 484, comps 1581380, sdev 10.345448, active bins 27357, items 262144
elapsed for STLPort-4.6.2/string-6-random POW2 485, comps 1581380, sdev 10.345448, active bins 27357, items 262144

elapsed for Hsieh/string-6-linear PRIME 94, comps 6145, sdev 0.775544, active bins 188020, items 262144
elapsed for Hsieh/string-6-linear POW2 94, comps 6145, sdev 0.639169, active bins 202819, items 262144
elapsed for Hsieh/string-6-random PRIME 109, comps 6145, sdev 0.775544, active bins 188020, items 262144
elapsed for Hsieh/string-6-random POW2 110, comps 6145, sdev 0.639169, active bins 202819, items 262144

Why Blizzard Loves Diablo II Cheats

floodyberry — Fri, 06 Oct 2006 09:25:27 +0000

Blizzard loves cheats? Are you sure? What about all their anti-cheat measures, like Rust Storm, Warden, and the mass bans we always hear about? Surely they wouldn’t fight something they are in favor of. Or would they? Let’s take a deeper look in to Diablo II and see just who is profiting from the use and abuse of cheating.

Types of Cheating

What is meant by a cheat anyway? Do you mean .exe hacks, in-game exploits, trade exploits, activities that affect the game economy, what? Does Blizzard love them all? There are actually quite a few different kinds of activities which can be classified as cheats, each with differing levels of severity. Some, while not being an intended activity of the designers, really don’t affect the game, while others cripple it to the point of unplayable. We’ll start with the most benign and work our way up from there.

Class A cheats (No real harm to the game)

GUI modifications: Modifying item colors, highlighting items and monsters on the automap.
Pickit hacks: Allowing your character to automatically pick up a user-defined list of items should they drop.
Map Hacks: Automatically reveals the entire map.
D2Loader: A no-cd loader for Diablo II which also allows you to run more than one client at a time.

Class A cheats have no real effect on the game outside of providing some automation for tasks a person could do just as well manually or providing some harmless bugs. Map hacks may at first appear to be a clear advantage, but the 10-15 seconds you save searching for an exit or portal are really only useful to a new player. The nature of Diablo II acquaints any player to the map layouts very quickly.

Class B cheats (Mild harm, more annoying than having any real effect on the game or economy)

Glitch Rush bug: Normally you may not advance to the next difficulty (From Normal to Nightmare, or Nightmare to Hell) until your character level is sufficiently high enough. This means you must take the time to level up before you can advance, which can be fairly tedious. If you have a character who cannot advance stand in Act 5 while a player who is a high enough to advance, but hasn’t, defeats Baal, your low level character will advance as well.This is how players get level 1 characters into Hell, which also lets you collect all 3 Hellforges for that player fairly quickly. Before higher runes are widely duped, this is a fast way to get Um, Mal, Ist, and Gul runes.An added detriment to this bug is that once you advance to the next difficulty, you can not see games made from the previous difficulty, e.g. a player in Nightmare difficulty who defeats Baal and advances to Hell will no longer be able to see Nightmare games. The detriment is that players using the glitch rush bug will trick unknowing players into defeating Baal so their lower-level glitch-rushed character can advance. The tricked players will then not be able to join Nightmare games to level in while at the same time be far too weak to participate in Hell games. This is seen by the bug abusers as humorous.
Drop trade hacks: Players are only allowed to carry a single Gheed’s Fortune, Hellfire Torch, and Annihilus at a time, e.g. if you already possess an Annihilus, you may not pick up another one and get twice the benefits. The problem is that these three items were not allowed to be placed in the trade screen, thus you were forced to drop them if you wanted to trade them to another player. Normally two players would stand fairly far apart, drop their items, then run to the other player’s position to collect their bounty; you had to take it on faith that the player would drop the proper item.As if this wasn’t bad enough, there were certain ways to cause other players to disconnect/crash from the game, which is where the hack comes in. Scammers would set up the trade, both players would drop their items, and the scammer would cause the other player to disconnect and then collect both valuable items, risk free.To Blizzard’s “credit”, they fixed this in June, 2006 (3 years after Annihilus and Gheed’s Fortune were introduced) by allowing all items to be placed in the trade screen.

Class B cheats can give you temporary advantages, but in the long run really don’t amount to much. Glitch rushing has no added benefit outside of slightly faster Hellforge quests and drop trade hacks are (finally) patched. Glitch rushing should be fixed, but is not particularly damaging.

Class C cheats (Severe harm to the game and economy)

Item Bots: These are bots that are automated to kill major bosses repeatedly while collecting any good items that drop in the process. While they won’t render the game economy useless by flooding it with thousands of high end items, they can still have a marked effect, especially in regards to mid-level items.The users of these bots are quite proud of the items they “find”. One blizzsector.net forum member proclaims (alt) “I bot for about 8 hours a day (sleeping) now, but sometimes I run out of rejuv potions and am too lazy to run around and find them, so I just don’t bother running it at night.”Examples of such bots are d2jsp, EasyPlay (Now defunct), and mmBot. Of the three, d2jsp and EasyPlay are somewhat protected against by Blizzard, while mmBot has remained undetected by Blizzard and is widely used.
PvP Bots: These include auto-aim bots and TPPK (Town Portal Player Kill) bots. TPPK bots are like autoaim bots, except they are used to kill players who are not dueling you. They work by firing a bolt weapon/spell at a player you would like to kill, quickly portaling to town and enabling hostile mode on that player.This is especially damaging when you are playing in hardcore mode, where a single death means the end of your character. If you are not careful about who you play with, your weeks or months of hard work on a character can be gone in an instant.

Class C cheats are where the quality of the game starts to deteriorate. The economy is impacted by item bots, the quality of PvP is lessened by aimbots, and the already high risk of hardcore mode is now heightened by the very community you participate in.

Class D cheats (Render the game unplayable and the economy a sham)

Duping: Duping is when a player duplicates an item using an exploit. The Origin of Bugged Items is a very good article on the history of duping and bugged items.Duping effects the game in a myriad of ways. The first is that the rarity of any item is completely controlled by the duper. If the government allowed private citizens to both create and spend counterfeit money, money would lose all meaning. This is the exact situation Diablo II is in, and has been in, for years. The rarest items in the game can be found in abundance at any trading forum. Players are decked out in equipment that would literally have taken years to find if they had actually played the game instead of resorting to bots and duped items. The trading elite are not the ones who invest in the game or make shrewd trades, they are the players backed by dupers with virtually unlimited buying and selling potential.Counterfeit items running rampant are not the only problem. As mentioned in The Origin of Bugged Items, many people make a lot of money creating and selling these counterfeits. Online item stores such as d2legit, jpitems, and enzod2 are pathetically easy to find, whether by a web search or by getting spammed in-game. eBay is crawling with snakes selling their counterfeit wares; you can even find enterprising individuals on Blizzard’s forums offering up items for cash. One particularly inventive shyster has even found a way to combine two of his favorite hobbies: Exploiting dupes and defrauding Google’s Adsense program by bribing players to click his ads (alt1 alt2). Let’s hope the AdSense abuse team needs Diablo II items more than they need his fraudulent clicks!
The more dishonest merchants (such as d2legit), go so far as to claim “…unless otherwise noted, all items on the site, including runes and SoJs, are legit.” This statement, while it may be true in regards to the extremely common items in stock, is a blatant lie for any exceedingly rare item and especially for any high rune. In their defense, I know of nobody playing Diablo II who would be fooled by their disclaimer, so it is probably meant to sway undecided small time players and not to be taken literally. The hardcore players who buy items from these stores know and accept that the majority of the high end items are dupes.

The counterfeit market additionally creates an atmosphere where you either use the counterfeits yourself or fall painfully behind the rest of the players in the game. When every player except you is using the most powerful items in the game, you either cheat to compete or give up. Futher more, many players don’t have the time to gather items to trade up for counterfeits or the knowledge of how to use cheats and bots and are forced to spend actual money buying items that a duper can clone to his hearts content. The situation is even worse than printing money in your basement as the sale of dupes is legal.

Now you can claim that the players are not forced to play the game, to cheat, to spend money on counterfeit items, and that they find real enjoyment in participating in the community, and you are mostly likely right (This is of course ignoring the players who do not have wads of time to play, refuse to cheat, and refuse to pay money for counterfeit goods). However, the agent that is perpetuating this fraud and benefitting the most is not the scammers or the players who support them; it is Blizzard. After I outline what Blizzard has done to counter the problems in the game and the effects of their efforts, you will hopefully see why I say this.

What has Blizzard Done?

Patch 1.06 – Dupe Scan

On April 19th, 2001, Blizzard released Version 1.06 of Diablo II. This patch featured a new dupe scan with the following description.

The Diablo II Realms now scan characters for duplicate items. If a player is found to have more than one of the same item, the duplicate items will be deleted leaving just one of that item.

Their FAQ further clarifies their altruistic reasoning for leaving a single item behind.

Q: Why did Blizzard Entertainment only delete some of the dupes, and not every dupe?

A: Blizzard Entertainment wants to maintain an enjoyable and balanced play experience for every user. To that end, we removed all but one duplicate item. We’re making an effort to protect those players that legitimately traded for those items.

Ahh, how thoughtful! They’re going to punish the cheaters by removing all but a single dupe from a (possibly legitimate) player’s inventory! Wait, what? How is this going to deter cheating? If anything, the players who are unable to make dupes and are forced to rely on the dupes as currency will now be penalized for it, while the dupers will enjoy a nice price gouging as the items they print off in their basement become scarce.

Patch 1.08 – Dupe Methods Blocked

On June 19, 2001, Blizzard released Version 1.08 of Diablo II. This patch claimed to block all known duping methods and to continue the dupe deletion started in 1.06

Q: What is Blizzard’s policy on item duping?

A:We believe that item duping undermines the basic rules of fair play and detracts from the spirit of true competition. Furthermore, we have discouraged item duping by blocking all known duping exploitations and have removed duped items from our servers. We shall continue to monitor and stop any attempts at item duplication

Despite their claims, duping was not blocked and their generous offer to purge items from legitimate players still stood; just how the dupes were able to survive long enough to propogate into the community and be used by players will be covered in the next section.

Patch 1.10 – Mass Bans, The Ladder, Annihilus, Rune Words, and Poofing

Mass Bans

The first of the Diablo II mass bans took place right before Version 1.10. On June, 10, 2003, Blizzard banned 112,000 accounts in keeping with their “aggressive stance against cheating”. This will turn out to be the first major tipoff on just how much Blizzard loves cheaters. You don’t ban 112,000 accounts without either wanting to make a strong statement and risk a community and market backlash or in an attempt to lure back addicted, and now banned, players who cheat and prosper because of your “aggressive stance against cheating”.

You see, the cheaters typically have many CD-Keys and run multiple clients at once, either to run multiple bots or to keep games open so they can transfer their items from character to character (especially to characters with clean CD-Keys that have never been flagged for cheating). This obviously means they are always acquiring new CD-Keys (alt) and will re-buy the game if necessary to continue playing. It is common to read about players on forums who have been banned for cheating, yet just buy a new CD-Key and start from scratch. Cheaters getting banned and still attempting to play because their cheats still work? “Backlash” is probably not the first thought that springs to mind.

The Ladder

A couple months after the first mass ban, Version 1.10 was released. A lot of new content was added, most of which is very interesting when you consider it from a cheaters perspective. One of the biggest new features was the introduction of the Ladder.

Introduced a new Ladder System for those who prefer to play free from any characters who may have participated in past item duping or hacking. To use this new feature, a player places a check in the Ladder box upon character creation — in the same way that Expansion and Hardcore selection is done. Periodically, the Ladder may be reset, adding the old Ladder characters to the regular population — who, of course, cannot play games with the new season’s Ladder characters. Thus, every season each new Ladder character truly starts from scratch, as no ‘twinking’ is possible from older characters.

The ladder also introduced many new and powerful Rune Words that were not available in non-ladder games. We will discuss Rune Words later, but for the moment just know that the new Rune Words were a very strong enticement to play on the Ladder, although perhaps not strong enough if you do not take the inevitablilty of cheating into account. The clean slate economy from each ladder reset would also be enticing to any player, but while it may have looked like a great chance for some clean fun to legit players, the low supply and enormous demand would prove to be irresistible to the cheaters and dupers Blizzard claimed to despise so vehemently.

Annihilus

The new patch also introduced many new items. The most interesting item is the Annihilus charm. You see, by this time there were millions of duped Stone of Jordans floating around, commonly referred to as a “soj”. Even though sojs are fairly rare when playing normally, they were duped massively early in Diablo II and became the main currency of the game. The designers, possibly in an attempt to clean up the world of duped sojs as their other methods weren’t working, created an Uber Diablo monster who only spawns when 100 sojs have been sold to NPC (Non Player Character) stores in a server. When you kill Uber Diablo, he drops a single Annihilus (commonly referred to as anni) charm, which is immensely powerful.

Oh, did I forget to mention that Blizzard runs multiple servers per single machine, and that Uber Diablo will spawn on all of them, even if you sell the sojs in a non-ladder game and host a ladder game, and that the ladder economy is much less infested with dupes compared to the non-ladder economy? Oh yeah, the players with the duped sojs will also create co-ops where multiple people each contribute 10-15 sojs, decide on a server to sell sojs on, all create multiple games with D2Loader and their plethora of CD-Keys, then collect their legitimate and very valuable and powerful Annihilus charms for a fraction of the 100 soj cost.

This may be setting off yet another alarm in your mind in regards to Blizzard’s “aggressive stance against cheating”. Here we have the game designers intentionally creating a way to turn duped items into very valuable new items which legitimate players can never find legitimately. You must be playing in a heavily duped environment such as non-ladder and possess a great many sojs to ever intentionally spawn Uber Diablo. A legitimate player will never find 100 sojs of his own outside of cheating with bots or buying dupes from item stores and other players, although even a bot would be hard pressed to find 100 sojs in a timely manner (rough estimates at 70 hours of playing per soj found = 290 days of straight playing to hit 100 sojs. Myself and two friends found 2 sojs in 3-4 months of playing). In case you were wondering, there are still plenty of duped sojs floating around as Blizzard has never fixed duping.

Runes and Rune Words

Since we’re on the subject of rewarding dupers and cheaters, let’s move on to runes and the new Rune Words in 1.10. The description of a rune from The Arreat Summit, Blizzard’s guide on Diablo II, is as follows:

Runes are small stones inscribed with magical glyphs that can be inserted into Socketed Items. Runes are different from other Insertable Items: not only do individual Runes have set magical properties, certain combinations (or Rune Words), when inserted into an item in the proper order, give that item even more wondrous abilities.

There are 33 runes, each more scarce than the one before it. To give you an idea of how scarce they become, here is a table with the odds of each rune dropping per monster killed. Note: The Countess is a boss monster who drops lower runes with a much higher frequency than most other monsters. All values are approximate and vary per monster.

Rune (Rank)   Countess     Super Boss    Normal Monster
-----------   --------   ------------    --------------
  El (  1 )        1/2          1/150           1/3,400
 Eld (  2 )        1/3          1/200           1/5,000
 Tir (  3 )        1/4          1/300           1/6,200
 Nef (  4 )        1/4          1/450           1/9,200
 Eth (  5 )        1/5          1/430           1/8,800
 Ith (  6 )        1/6          1/600          1/13,000
 Tal (  7 )        1/6          1/530          1/10,000
 Ral (  8 )        1/8          1/700          1/15,000
 Ort (  9 )        1/9          1/750          1/15,000
Thul ( 10 )       1/13        1/1,100          1/22,000
 Amn ( 11 )       1/14        1/1,300          1/24,800
 Sol ( 12 )       1/20        1/1,500          1/12,000
Shael( 13 )       1/27        1/2,600          1/47,000
 Dol ( 14 )       1/41        1/3,500          1/70,000
 Hel ( 15 )       1/53        1/5,000          1/91,000
  Io ( 16 )       1/80        1/6,800         1/130,000
 Lum ( 17 )      1/100        1/9,000         1/180,000
  Ko ( 18 )      1/160       1/13,000         1/270,000
 Fal ( 19 )      1/200       1/17,000         1/350,000
 Lem ( 20 )      1/300       1/28,000         1/530,000
 Pul ( 21 )      1/423       1/35,000         1/715,000
  Um ( 22 )      1/635       1/53,000       1/1,000,000
 Mal ( 23 )      1/739       1/60,000       1/1,200,000
 Ist ( 24 )    1/1,100       1/90,000       1/1,800,000
 Gul ( 25 )  1/120,000      1/100,000       1/2,100,000
 Vex ( 26 )  1/185,000      1/160,000       1/3,200,000
 Ohm ( 27 )  1/210,000      1/200,000       1/3,800,000
  Lo ( 28 )  1/320,000      1/260,000       1/5,000,000
 Sur ( 29 )         NA      1/350,000       1/6,500,000
 Ber ( 30 )         NA      1/500,000      1/10,000,000
 Jah ( 31 )         NA      1/600,000      1/11,000,000
Cham ( 32 )         NA      1/800,000      1/17,000,000
 Zod ( 33 )         NA    1/3,000,000      1/60,000,000

To put these drop odds in perspective, we will need to figure out how many monsters you can kill on average. Blizzard limits you to joining around 20 games an hour (this is to combat item bots, although it often combats legitimate players from playing), giving you about 3 minutes per game. There are around 10-15 “Super Bosses” you can kill to give you a chance at the higher runes. Assuming you can get 5 in 2 minutes (highly unlikely without a bot unless you target weaker super bosses with much worse drop odds), that would give you about a minute for average monsters for which I’ll generously claim 50 kills. These numbers would give you 5*20 = 100 Super Bosses an hour and 50*20 = 1,000 regular monsters an hour. Plugging these numbers in for the high runes, we get:

Rune (Rank)     Hours Required To Find
-----------   ------------------------
 Gul ( 25 )      693 hours or  28 days
 Vex ( 26 )    1,086 hours or  45 days
 Ohm ( 27 )    1,318 hours or  54 days
  Lo ( 28 )    1,753 hours or  73 days
 Sur ( 29 )    2,275 hours or  94 days
 Ber ( 30 )    3,333 hours or 138 days
 Jah ( 31 )    3,882 hours or 161 days
Cham ( 32 )    5,440 hours or 226 days
 Zod ( 33 )   20,000 hours or 833 days

Now runes in and of themselves are not that useful. A few of them have nice magical properties, but on the whole you almost never use a rune on it’s own. Instead, you combine them in to powerful Rune Words. A Rune Word is a set of runes placed in a socketed item in a set order; think of it as a recipe for a powerful item with the runes being the ingredients. The definition of a Rune Word from The Arreat Summit is:

If the player puts certain combinations of Runes in the correct order into an item with exactly that number of sockets and of the correct item type, the item’s name will change into a “unique” name, displayed in gold, and the item will acquire extra powers, depending on the “rune word” that was used.

Let’s take a look at the cost of some of the new Rune Words from the 1.10 patch. From the odds, we will assume that finding a single rune will provide one of each rune below it, so finding a Zod will give you one of each rune below to work with. This does not exactly hold up in the real game due to varied drop odds for different monsters.

Runeword                                  Days
-------------------------------------------------------     -------------
Breath of the Dying:   Vex + Hel + El + Eld + Zod + Eth     833(Zod) days
Call To Arms:          Amn + Ral + Mal + Ist + Ohm          54(Ohm) days
Chains of Honor:       Dol + Um + Ber + Ist                 138(Ber) days
Doom:                  Hel + Ohm + Um + Lo + Cham           226(Cham) days
Enigma:                Jah + Ith + Ber                      161(Jah) days
Exile:                 Vex + Ohm + Ist + Dol                54(Ohm) days
Heart of the Oak:      Ko + Vex + Pul + Thul                26(Vex) days
Faith:                 Ohm + Jah + Lem + Eld                161(Jah) days
Grief:                 Eth + Tir + Lo + Mal + Ral           73(Lo) days
Infinity:              Ber + Mal + Ber + Ist                276(Ber*2) days
Last Wish:             Jah + Mal + Jah + Sur + Jah + Ber    483(Jah*3) days
Phoenix:               Vex + Vex + Lo + Jah                 161(Jah) days

It would take a legit player 2.2 years of non-stop playing before he/she found a Zod rune to complete Breath of the Dying. Last Wish would take a paltry 1.32 years. Infinity clocks in at 0.75 years. It should be further pointed out that once a rune is used in an item, it is gone for good. There is no way to extract a rune from an item, so if you mess up a recipe or want to create another Rune Word, you will need to find another copy of each rune. The very act of using a rune will implicitly drive the demand for that rune higher. Every rune and runeword I listed are available virtually without limit from item stores and traders on Blizzard’s official forums and have been since around one month after the last ladder/economy reset.

To get an idea of how often the higher runes are legitimately found, in about 3-4 months of playing, myself and 2 friends found a total of 1 Mal, 2 Ist, 3 Vex, and 1 Ohm. Meanwhile the market had been so flooded with dupes that high runes were readily available for most of the time we were playing even though the ladder had just been reset and everybody was starting from zero. Now, a year later, traders on forums will make deals involving hundreds of high runes (alt) at a time. 2.2 years of non-stop playing time to find a Zod and you can purchase 40 “legit” Zods (alt) for $24 at d2legit.com. Many thousands more are implicitly available in pre-made runewords made with the highest grade items. Please remind me again how an “aggressive stance against cheating” results in Rune Words blatantly made for cheaters, online stores selling thousands of counterfeit goods, and the absolute destruction of the game’s economy a month after each ladder reset?

As one player on Blizzard’s forums said:

I agree with most people saying that dupe hack are really a plague but runewords like Last Wish look as they are made to tell people “Try a dupe mod, it’s the only way to make this…”

Poofing

Up to this point, we’ve seen Blizzard create a ladder with a clean economy that is ripe for huge profits from cheating (both in game and out), create items such as the Annihilus that a legitimate player can never acquire, and create Rune Words that would take a legitimate player many years of playing to collect the necessary runes. With their “aggressive stance against cheating”, you would think they would have done something for the legitimate players. Well, it turns out you are right. They have kept the exact same dupe-scanning system which has been in place since 1.06! You know, the one that doesn’t hurt dupers and penalizes legitimate players? The “technical” term for an illegal item being deleted is “poofing”, and it is the cornerstone of Blizzard’s ability to allow duping while keeping the economy from turning into World War One Deutschmarks.

While it is true the system will detect and delete dupes, there is a miniscule catch; a tiny, insignificant, hardly worth mentioning, extremely well known method of bypassing this check. In fact, there are multiple well known methods (Perming Guide from 1.09 anyone?), although there may be many more that are kept secret by the duping community.

The easiest method is simple and extremely reliable. If you socket a duped rune in an item, the duped rune will be safe from poofing. Therefore you can take your duped Vex and Zod runes (along with the easy to find Hel, El, Eld, and Eth runes), socket them properly in an 6 socket Poleaxe weapon, and you will have a “Breath of The Dying” Poleaxe which is perfectly safe from poofing. You can even socket a single Vex rune in your weapon for safe keeping until you acquire a Zod to complete the Rune Word.

A more popular method is the one that allows you to protect any dupe from poofing, whether it is a rune, item, jewel, charm, whatever. Our good friends at enzod2.com have instructions on how to “temp perm” items you buy with real money from their counterfeit store. This method is known by nearly all players and is widely used to keep vast stores of dupes safe from deletion while trading.

The Temp Perm Method will only allow you to keep your dupes for THAT game ONLY. You MUST repeat the Temp Perm Method for EVERY game in order for your dupes to never disappear. Also, Runewords that are made out of duplicate runes will not have to be permed.

The Temp Perm Method:

Open a trade window with another player.

Put your duped/potential duped item you want to perm in the trade window. (This step is actually optional).

Save + Exit immediately after closing your trade window.

The above will work every time, just do not forget to do it every game to be safe.

Remember to make sure you have the trade window up before cancelling the trade window and then exiting game.

Note that if you forget to do this, you will risk losing any duped items. If you do not realize that the item you just traded for is a dupe, if you accidently forget to “temp perm” your dupes, if you do not think that dupers would traffic in such a low item, you risk losing your items. A search for “poof” on Blizzard’s official USWest Ladder forum yields hundreds of results.

20/19/7 “legit” anni i bought from you poofed.. sigh..

Don’t feel bad, last night my 399/39/15 Grief, 27× 20 life SC’s, and my 140/15/40 Dungo poofed. Paid 64 HR’s for all of it, lol.

the coa that poofed on me was a 2/26/30. hopin for stats around there LEGIT tho plz.

maras must be legit, I have had so many items poof lately its not funny. Post Offers.

My 15 sup 1368 Archon Enigma poofed on me

yep mine just poofed. leave stats and price. this fvcking sucks.

perfect coa, lucky to get 30 for it lately, too many dupes and too many poofing

um both of your sojs poofed on me ….

ya its duped; just not mass duped..but still will poof w/o perm

Granted, none of these posts were by legitimate players, but that is to be expected. The economy is so flooded by dupes now that you can not buy or trade any item of higher than medium value without having a high probability of it being duped. You can also not sell a valuable item without risking it finding it’s way to a duper, who then carefully floods the market with copies. Trading forums will often have a list of banned dupes, such as d2jsp.org’s list of banned items on USEast Ladder. These lists are of course always incomplete, always growing, and have no effect on the propagation or trading of dupes. You can not accept Diablo II items from other players and stay legitimate at the same time. Thanks for looking out for the little guy, Blizzard!

Where To Now?

Little has changed since the 1.10 patch. Blizzard did have a 36,000 account mass ban on Aug, 11, 2005. The mass ban happened to coincide with the latest ladder reset and the release of Version 1.11 which added a few new items, a new quest, a new anti-cheat tool borrowed from World of Warcraft called Warden, and absolutely nothing against duping. Another mass ban of 35,000 accounts occurred on July 24, 2006; other than a possible revenue boost to Blizzard, the effects of the ban were little.

To Blizzard’s credit, Warden has been largely successful in chasing out most hacks. The downside is that it has taken nearly a year for hack authors to back down, mostly due to the glacial update frequency of Warden. Even worse, mmBot, one of the truly damaging programs, has remained completely undetected. Proponents of mmBot claim it is because mmBot is driven by AutoIt, a script-driven engine which does not hack Diablo in any way and works purely off of graphical analysis, but this is a weak reason to ignore it.

To Blizzard’s major discredit, duping was not, nor has ever been, fixed. Bugged Non-Ladder items from previous patches are rampant in the ladder economy, almost any valuable item is a dupe (whether the person trying to trade claims it is legit or not), legitimate players are at risk every time they acquire items they do not find themselves, and nobody cares because business is scamming, and scamming is good. The black market is the market and is dictated by what the dupers manage to acquire and sell to the public, nothing else.

You could argue that the game is successful because of the hacks and dupes and not in spite of them. You might also argue that without the joy of finding items with bots, the glee of killing other players with third party hacks, or the pride of selling counterfeit items to strangers is a game in itself, and worthy of supporting. You could even argue that Blizzard is merely giving the people what they want and should be lauded for their keen business sense. Whatever you argue, one thing is certain: With each update to Diablo II and each failure to either fix duping or admit defeat and end the game, Blizzard is saying loud and proud: We Love Diablo II Cheats!

Postscript

I will admit that the community in no way helped inform my feelings on the game. Where exactly is the fun in participating in a community where you must interact with hordes of immature con-artists who have no integrity and whose only goal in life seems to be sucking up to the stronger while trampling the weaker? Who wants to talk with people who think insults, bragging, lying, scamming, and illiteracy are virtues? The have-nots are little better, constantly begging and then insulting you if you do not do enough for them as I have seen many times over while attempting to give items away. The few honest players doing their best to make the game more enjoyable for everyone only hasten the rise of the gutter-trash to the top. Were there no corruption in the game and the players remained the same I would still not interact with them, but I believe the increased enjoyment from a clean economy would make up for their presence.

In any event, there has been a bit of posturing over a possible new patch and ladder reset, and the official suggestions report contains “Run regular Ruststorms weekly to help remove Duped Items” and not “Remove duping entirely”. If the trends of the past 6 years continue, it looks like everyone can look forward to another glorious ladder season which will slowly spiral downward as the dupers once again squeeze every penny out of the arbitrary economy they have ruled all these years.