programming opiethehokie

Zero-Knowledge Proofs: Verifying Computation and Preserving Privacy

noreply@blogger.com (opiethehokie) — Tue, 01 Jul 2025 00:05:00 +0000

Expanding Verifiability Beyond Merkle Trees in the Age of AI

From my previous post, you're already familiar with Merkle Trees as a powerful data structure for efficient and secure validation of contents. You know they achieve this by hashing data and then hashing those hashes up a tree, allowing for Merkle proofs that can verify data inclusion or consistency without revealing the entire dataset. This capability is becoming even more crucial as AI-generated content makes verifiable content more imperative.

But what if you need to prove something more complex than just data inclusion? What if you need to prove a computation was performed correctly, or that you know a secret, without revealing that secret? This is where Zero-Knowledge Proofs (ZKPs) come into play, offering new dimensions of verifiability and privacy.

What are Zero-Knowledge Proofs?

A Zero-Knowledge Proof is a cryptographic protocol where a prover can convince a verifier that a statement is true, without revealing any information beyond the truth of the statement itself. Think of it like proving you're over 18 without showing your ID or revealing your name and address. ZKPs bring two main "primitives" or building blocks:

Computational Integrity (Succinctness): They allow you to create proofs of computations that are significantly easier and faster to verify than to perform the original computation. This means the proof itself remains small, regardless of how complex the computation being proven is. Just as a Merkle proof is small compared to the original data, a ZKP is small compared to the computation it verifies.
Zero-Knowledge (Privacy): They provide the option to hide parts of the computation (like sensitive inputs or even parts of the model) while still proving its correctness.

While generating ZK proofs can be very computationally intensive, advancements in cryptography, hardware, and distributed systems are making them feasible for increasingly complex computations. This expansion of capabilities opens up a vast "design space for new applications".

Programming ZKPs: A Shift in Mindset

Unlike traditional programming, which focuses on how to compute, programming ZKPs (often called circuits) focuses on defining a set of constraints. These constraints are mathematical rules that the computation must satisfy. For example, you might constrain that two secret numbers multiplied together equal a public number, without ever revealing the secret numbers.

The typical workflow for building a ZKP involves:

Writing the circuit: Defining the constraints of your computation.
Building the circuit: Compiling it into a binary form and WebAssembly.
Trusted Setup: A crucial pre-processing step that generates a proving key (for the prover) and a verification key (for the verifier).
Generating the proof: Using your private inputs (the "witness"), the compiled circuit, and the proving key.
Verifying the proof: Using the verification key, the public output, and the generated proof.

Concepts like hash functions are fundamental in ZKPs, just as they are in Merkle Trees. However, ZKPs often use "ZK-Friendly hash functions" like Poseidon, which are optimized for use within ZKP circuits, offering significant performance gains compared to traditional hashes like SHA-256 due to their arithmetic-based implementation.

Commitments, a cryptographic primitive allowing you to "commit" to a secret value without revealing it, are also crucial, often built using these hash functions. These are key building blocks for applications like digital signatures and more advanced concepts like group signatures, where you can prove you are part of a group without revealing your specific identity.

Programming ZKPs: An Example

Lets walk through a basic example of proving that we know two numbers whose product is 36 without revealing what those numbers are.

Write the circuit using Circom. c <== a * b; is the constraint, two numbers multiplied together equals a third number.

Circom compiles it into a Wasm file which we'll use to generate a witness for specifying our private inputs when creating the proof and a Rank 1 Constraint System binary file mathematically defining our single constraint.

Then we perform a trusted setup. The generated Common Reference String (CRS) consists of a proving key and a verification key. These keys can then be used every time we want to generate and verify proofs, respectively. They can be shared publicly and I provide mine here as part of the example:

Finally, we generate the proof using snarkjs with the Wasm file, proving key and private input that might be something like 9 and 4.

We get proof and public output JSON files.

We've proven that we know two secret values, a and b, whose product is 36. You can verify the proof (assuming you trust my verification key) with snarkjs.

$ snarkjs groth16 verify example1_verification_key.json public.json proof.json
[INFO] snarkJS: OK!

If you change public.json to contain a different number the proof will no longer be valid. I've no longer proved I know the factors of this new number.

ZKPs and Blockchains, and Machine Learning (ZKML)

The convergence of ZKPs, blockchains (Web3), and machine learning is a rapidly advancing area with significant potential.

Blockchain use cases include:

Scaling Blockchains: Public blockchains have limited computational power. ZKPs enable computations to be executed off-chain, with only a small ZK proof verified on-chain. This scales blockchains without sacrificing decentralization or security. Examples include ZK rollups like Polygon zkEVM and zkSync.
Privacy-Preserving Applications: The zero-knowledge property is ideal for creating applications that protect users' privacy and personal data when making cryptographic attestations. Aztec Network, for instance, uses a ZK rollup for Ethereum where users' balances and transactions are completely hidden.
Identity Primitives and Data Provenance: Projects like WorldID use ZKPs for privacy-preserving proof-of-personhood protocols, allowing a person to prove they are a unique human without revealing their identity.

ZKML is about applying ZK proofs to machine learning models, specifically focusing on the inference step. The core motivations for ZKML include:

Verifying AI-Generated Content: With AI content becoming indistinguishable from human-created content, ZKPs can help determine that a particular piece of content was produced by applying a specific model to a given input.
Privacy-Preserving Inference: ZKPs allow you to apply an ML model to sensitive data, where a user can get the result of the model's inference without revealing their input to any third party.

While proving something as large as current LLMs with ZKPs is not currently feasible, there's significant progress on creating proofs for smaller models. Teams are actively working on improving ZK technology, including specialized hardware and proof system architectures, to allow proving bigger models on less powerful machines in less time.

Summary

While Merkle Trees excel at verifying data inclusion and consistency, ZKPs extend this idea to verifying computations and knowledge with the added benefit of privacy. This makes them incredibly powerful for building the next generation of scalable and private applications on blockchains, especially as AI-generated content and privacy concerns continue to grow. The future of verifiable content, whether data or computation, is increasingly intertwined with these advanced cryptographic proofs.

Update

A few days after I published this I saw Opening up ‘Zero-Knowledge Proof’ technology to promote privacy in age assurance from Google showing that some well-known players are active in this space as well.

Sources

https://zkintro.com/articles/programming-zkps-from-zero-to-hero

https://world.org/blog/engineering/intro-to-zkml

Merkle Trees

noreply@blogger.com (opiethehokie) — Mon, 17 Feb 2025 16:44:00 +0000

Merkle Trees are a data structure that allows for efficient and secure validation of contents. They are typically implemented as binary trees. Given N pieces of data, they would have 2N nodes and log(N) height. Each leaf node is the hash of a piece of data and every non-leaf node is a hash of its children's hashes. With only a hash stored at each node Merkle trees have a small, predictable size compared to the data of a large N.

Examples of Merkle tree usage include detecting data inconsistencies between replicas in NoSQL databases, ensuring file integrity in distributed storage systems, and verifying blockchain transactions. We can easily create one with pymerkle.

└─9f42c047...

├──50fcd75a...

│ ├──4c4b77fe...

│ │ ├──e8bcd97e...

│ │ │ ├──2215e8ac...

│ │ │ └──fa61e3de...

│ │ └──9c769ac2...

│ │ ├──906c5d24...

│ │ └──11e1f558...

│ └──fed7af7d...

│ ├──2b15ae18...

│ │ ├──53304f5e...

│ │ └──3bf9c81c...

│ └──8007dd69...

│ ├──797427cf...

│ └──195f58bc...

└──555da077...

├──85224a5c...

└──5c889ef4...

A "Merkle proof" is a list of intermediate hashes from the path between a leaf node representing the data you want to prove and the root of the tree. Generating the proof is like a modified DFS. The beauty of this is that anyone can verify the data is included in the tree without the whole tree being revealed.

{

"metadata": {

"algorithm": "sha256",

"security": true,

"size": 10

"rule": [

"subset": [],

"path": [

"53304f5e3fd4bcd20b39abdef2fe118031cc5ae8217bcea008dea7e27869348a",

"3bf9c81c231cae70b678d3f3038f9f4f6d6b9d7adcf9b378f25919ae53d17686",

"8007dd69b92a67ea6410098635fa8ba53c44a5994c7e5d92b99e27f0711c626f",

"4c4b77fe3fc6cfb92e4d3c90b5ade42f059a1f112a49827f07edbb7bd4540e7b",

"555da077fcadba1f23e0f2bfac8793e6a3c79a0d605902df34ab43d3e0fb487c"

]

}

Verifying the proof requires only calculating the root hash from the provided proof and the data being verified. If the calculated root matches the known root of the tree then the data is present in the tree. We are using less space and less compute than if we were iterating a list to check if data is present.

That was an inclusion or audit proof. We can also do a consistency proof. In an append-only tree we can verify earlier versions of the tree against later versions to make sure no tampering has occurred. The later version must include everything in the earlier version, in the same order, and all new entries come after old entries.

{

"metadata": {

"algorithm": "sha256",

"security": true,

"size": 10

"rule": [

"subset": [

"path": [

"fa61e3dec3439589f4784c893bf321d0084f04c572c7af2b68e3f3360a35b486",

"2215e8ac4e2b871c2a48189e79738c956c081e23ac2f2415bf77da199dfd920c",

"9c769ac26f8d61ff40859e5201537845555136f0fd7ab604f7033180fbe76af9",

"fed7af7d64bf0a73fcad018df1219928dbafa4d96b5d78f8a5e9be66ff0ada38",

"555da077fcadba1f23e0f2bfac8793e6a3c79a0d605902df34ab43d3e0fb487c"

]

}

Given how powerful and pervasive Merkle trees are, I'm surprised they aren't discussed more along with other common data structures. It seems that with AI-generated content making verifiable content more imperative, their usage will only increase going forwards.

CUDA Kernels: Speeding up Matrix Multiplication

noreply@blogger.com (opiethehokie) — Tue, 16 Jan 2024 02:37:00 +0000

The Python ecosystem benefits greatly from be able to use libraries written in C/C++ and Rust (see my previous post) to increase performance. Increasingly, though, I've been running code on GPUs instead. Libraries like PyTorch simplify this, but how do they do it? Previously I could only answer something like "it's CUDA" without knowing what that really means. In this post we'll dive deeper and see what it takes to create our own CUDA kernel. We'll (re-)implement matrix multiplication, run it on a GPU, and compare to numpy and PyTorch performance.

First, what is CUDA? CUDA is a general purpose parallel computing platform and API. It gives direct access to NVIDIA GPU's virtual instruction set and parallel computation. It works with C, C++, Fortran, has Java and Python wrappers, and is supposed to be easier to use than earlier APIs like OpenCL. CUDA allows us to execute kernels which are functions compiled for GPUs, separately from the main program. This is how the majority of deep learning taking place today functions.

Second, what makes GPUs so special? GPUs are generally slower than CPUs at one operation but excel at operations that can be parallelized. They can have thousands of cores and higher memory bandwidth. If we want to tackle more than embarrassingly parallel problems, like matrix multiplication, this means we need a parallel algorithm to make use of the GPU.

Third, how do we access it from Python? It turns out there are several options, or several Python wrappers for this. I chose to go with Numba, a just-in-time (JIT) compiler that can target both CPUs and GPUs. It lets you write kernels directly using a subset of Python. It does not implement the complete CUDA API, but supports enough to tackle many problems. There are other (uninvestigated) options like CuPy and PyCUDA as well.

Lastly, before we get to the code, is the topic of writing efficient kernels. While this is beyond the scope of my post, I can say that I've at least learned it requires understanding several additional concepts beyond general concurrent programming. CUDA kernels are not for the faint of heart. It's not something you can just pick up in a couple of hours or from following one tutorial. One of the things you'll first notice, as a very simple example, is the need to explicitly move data back and forth between CPUs and GPUs. And kernels can't return values, you can only pass inputs and outputs.

For a better intro, I recommend reading the four-part CUDA by Numba Examples series (part 2, part 3, part 4).

Our baseline for this experiment will be numpy's highly-optimized matmul function. It supports vectorized array operations and some multi-threading by releasing the GIL per Parallel Programming with numpy and scipy. The underlying BLAS routines will be optimized for your hardware. This method of matrix multiplication has been tuned over the past several decades.

Here we'll try our own kernel. It's moving the computation from the CPU to the GPU, but it's likely lacking many optimizations that a battle-tested library has. Still, for medium-sized, two-dimensional matrices, we get some performance improvement.

Now we'll use PyTorch's matmul function. It's highly optimized like numpy and has the GPU benefit like the kernel. It's amazingly fast and works with larger, higher-dimensional matrices.

NVIDIA has a profiling tool called Nsight Systems that we can use to see GPU utilization. GPUs are expensive, we would want them to be fully utilized. From the reports, I see that the PyTorch implementation used more threads so that's consistent with higher parallelism. It also seems to have a higher ratio of memory operations vs. kernel operations. I'm not sure I understand what that means, but the kernel operations are sgemm which looks to be a matrix multiplication algorithm akin to numpy using BLAS.

Creating a CUDA kernel has become accessible enough that I could do it in a couple of hours on a laptop, yet my implementation remains far from the top implementations. Matrix multiplication is obviously common and great libraries exist. For less common operations, even if there's a known parallel algorithm, I would hesitate going the custom kernel route again. It's not a one-off thing you would just try to improve performance, it requires a way of thinking and deep optimization knowledge that most software developers don't have. The underlying libraries used by PyTorch are, for example, optimizing use of memory caches, shared vs. global memory accesses, thread utilization, and probably tuning kernel parameters for my specific GPU.

UPDATE:

NVIDIA Finally Adds Native Python Support to CUDA.

Rust to Python and Fibonacci Numbers

noreply@blogger.com (opiethehokie) — Wed, 12 Jul 2023 12:19:00 +0000

While Python continues to be the language of choice for ML projects, I'm increasingly seeing mentions of Rust in packages I use. Hugging Face fast tokenizers are written in Rust to speed up training and tokenization. They say it "Takes less than 20 seconds to tokenize a GB of text on a server's CPU." I recently used Polars, which is also written in Rust, for a quick task where I wanted to run a SQL query on 20 GB of CSV files. It only took about 60 seconds on my laptop. It's a way of bypassing Python's GIL to write code that is parallelizable. This is in contrast to packages like numpy that traditionally have been written in C/C++ for performance reasons.

How does one turn Rust code into something you can call from Python? It turns out I've already done something similar, compiling Rust into WebAssembly and using it in a JavaScript project. For Python, the process is similar. I followed Calling Rust from Python using PyO3 and re-created the Fibonacci numbers experiment (apparently I'm not the only one with this idea) from my previous post. It's a toy example, just intended to show the ease with which Rust can be leveraged. It ignores more complex data types and any actual multi-threaded code. Perhaps that will be the topic of a future post.

The function in Rust looks like this:

And the Python code that calls it and times it:

Using the same n=35 as in my JavaScript experiment, I'm seeing about .05 seconds for Rust vs. 7.31 seconds for pure Python. The likelihood of wanting/needing this in a Python project seems greater than with JavaScript. My guess is that the trend continues.

Queueing Simulations

noreply@blogger.com (opiethehokie) — Tue, 31 Jan 2023 03:17:00 +0000

Waiting in lines is something we've all experienced. But how long will the wait for a table really be? How many checkouts should the grocery store have open? Queueing theory gives us the tools to be able to answer questions like these and more. It's also relevant to understanding software under heavy load. I've seen it come up recently in The Amazon Builder's Library Avoiding Insurmountable Queue Backlogs and Site Reliability Engineering's Addressing Cascading Failures chapter.

Since you are reading this blog it is assumed that you are already familiar with queues and the idea of using them in software systems. To summarize, queues can improve availability and throughput at the expense of latency and resource consumption. Here we will see what the field of queueing theory can teach us about how and when to best use them.

The Builder's Library article warns us of the bi-modal behavior of queue-based systems: the behavior can be very different based on whether there is a backlog or not. Recovery time after an outage can be dramatically increased and time can be wasted doing work that is no longer useful. We then get the first hints of how to manage this. Using more than one queue helps shape traffic. LIFO-ish behavior can be more desirable than FIFO. The SRE book discusses throttling via small queue sizes so requests are rejected when the incoming request rate is too high to be sustained. Traffic patterns, queue sizes, and the number of threads removing work from the queue are all intertwined. Can we better quantify these recommendations though?

I first learned of queueing theory in The Essential Guide to Queueing Theory e-book from VividCortex and wanted to revisit it in the context of the SRE reading. Right from the beginning we are warned that while queues are linear data structures, queueing systems behave non-linearly. Queueing theory is probabilistic. We won't understand the behavior exactly, but will know about wait times on average or a distribution of wait times.

The first thing they suggest we must understand is that queueing happens even when there is enough capacity to do the work. This is because work arrives to the queue in irregular sizes and at irregular intervals. Queueing gets worse at high utilization, when there is high variability, and when fewer workers are servicing the queues.

Any system can be decomposed into networks of queues and workers. These are the parameters and metrics commonly used to understand these systems:

arrival rate to the queue (λ or A) - in a stable system this is the same as throughput
average wait time or residence time in the queue (W or W_q)
average post-queue service time (S)
service rate (μ) - inverse of service time
latency or residence time (R) - sum of wait time and service time
worker utilization (ρ or U) - 0 to 1
average number of requests waiting or in service concurrently (L or L_q)
number of workers (M)

Little's Law states that concurrency is arrival rate times residence time: L = λR and L_{q =}λW_q

The Utilization Law states that utilization is throughput times service time: ρ = λS or ρ = λ/μ or with M workers ρ = λS/M

Kendall's Notation is a shorthand for describing queue systems. Normally we'll be working with M/M/m systems. The first M says events arrive randomly and independently (memoryless) meaning they are generated by a Poisson process. The second M says the service times are exponentially distributed. The last m is the number of workers or M from above. These are safe assumptions unless you know otherwise.

For a M/M/1 system R = S / (1 - ρ) which is a fast-growing curve (the non-linear mentioned above). Residence time is inversely proportional to idle capacity. By rearranging the equations above we can solve for other parameters as well. For systems beyond M/M/1 it gets more complicated. For M/M/m, for example, we need to bring in Erlang's C-formula.

To do that, and to see these laws in action, it's time to write some code. There appear to be several Python packages for simulating queueing systems. I will use Ciw for this simulation. We'll assume we have 1,000 requests per minute, several servers pulling work from a queue, and need enough capacity such that 2 servers could be offline without the system falling behind.

servers	utilization %	latency (sec)
3	83	.35
4	63	.18
5	50	.15

We can use this simulation to figure out what "levers to pull" to get the behavior we want. It could even become a Linear Programming optimization problem. Based on the arrival rate and service rate it's easy to see we need at least 3 servers, but not as easy to see how the latency would change going from 3 to 5 or what the minimum queue size is.

While not always intuitive, especially with multiple queues, there are ways to understand how queue-based systems will behave. If you are interested in the topic, I suggest reading the linked resources as they go into much more detail. If you have other examples of queueing theory in action, please share in the comments.

Java Updates Versions 9-19

noreply@blogger.com (opiethehokie) — Sun, 23 Oct 2022 16:17:00 +0000

The past few years I haven't written much Java code and when I did, it was Java 8. Many projects, it seems, have stuck with Java 8 which was released back in 2014. Per the roadmap, Java 8 is designated as LTS, but so are Java 11 and Java 17. In fact, Java 19 is available as of last month and many interesting features have been introduced in the past 8 years. This post is an overview of what's changed. The highlights, in my opinion, so we're up-to-date. I think it's enough to be interesting but not so much that it can't be picked up quickly if you have experience with older Java versions.

First, some name conventions. Java EE is now Jakarta EE. Definitely don't call it J2EE anymore. And since Java 11 Oracle JDK and OpenJDK are basically the same.

Tooling Updates

I won't go into too much detail here. If we are interested in using any of these they are explained and documented well elsewhere. For awareness:

Java got a REPL called JShell for learning and prototyping interactively (9)
Java Module System as a new package abstraction (9) - my impression is that this more for the JDK itself and some libraries while OSGi continues to be applicable for regular, modular apps
Single-file programs can be executed directly and don't need the intermediate javac step (11)
Multiple new garbage collection options for improved and more consistent performance (11)

ZGC for low latency
Epsilon GC no-op

Linux container awareness for GC, thread pool sizes, etc. (11)
Application class data sharing (CDS) for improved startup times and lower memory footprints (12) - this also seems to have implications for apps running in containers

API Updates

We want to start incorporating these into our code where applicable, so I created examples to help get used to some of these updates.

Private interface methods (9). Helps to encapsulate code in default methods and create more reusable code.

Variables can have implicit types, including in lambdas, to reduce the verbosity of code (10 and 11).

Switch expressions to simplify code and prepare for pattern matching in the future (12).

Text blocks as way to simplify code with multi-line strings (13).

Simple pattern matching (14). I was introduced to pattern matching when programming in Scala and this seems to continue a trend of Scala features making their way, in some form, to Java. It looks like more is coming in terms of pattern matching options.

Record keyword for immutable data classes (14). Getters, a public constructor, plus equals, hashCode, and toString methods are generated automatically. Lombok is still more flexible, but this is nice for simple cases.

Sealed classes for fine-grained inheritance control (15). Super-classes that are widely accessible but not widely extensible.

Paradigm Updates

For lack of a better name I'll call these paradigm updates as they relate more to programming models.

Flow API as an implementation of the Reactive Streams Specification (9). This seems to be a way to get the specification interfaces into the JDK but not necessarily replace libraries with better implementations.

Vector API introduces vectorization to Java (16). It looks like it doesn't happen automatically and requires special code, so to be more useful I think it needs to be included in a common library like NumPy in Python.

Virtual threads and structured concurrency (19). The one-to-one kernel to user thread mapping is broken enabling easier asynchronous programming. Read/watch Project Loom: Revolution in Java Concurrency or Obscure Implementation Detail? The tl;dr is we'll still need a higher level of abstraction like reactive programming unless you want to relearn all the low-level concurrency structures.

Probabilistic Graphical Models

noreply@blogger.com (opiethehokie) — Fri, 15 Apr 2022 18:43:00 +0000

Deep learning and neural networks get a lot of (deserved) attention, but there is another class of ML models called Probabilistic Graphical Models (PGMs) that can also be used for inference and prediction. They have applications in fields such as medical diagnosis, image understanding, and speech recognition. Think decision making based on incomplete or insufficient knowledge.

More formally, PGMs use graphs to encode joint probability distributions as opposed to the more traditional ML approach of learning a function that directly maps input to a target variable. This post isn't a technical introduction though. Rather, it is more of an introduction-by-example and a summary of pgmpy's excellent notebooks.

Given a simple graph for flower type:

Our two approaches would look something like this:

Bayesian networks

In this section I'll use a more complex graph for student grades:

For problems with many features and/or high cardinality features, inference will be difficult because the size of the joint probability distribution increases exponentially. PGMs can compactly represent it by exploiting conditional independence. They provide us efficient methods for doing inference over these joint distributions.

In this graph we have cardinalities of 2 for each node except Letter which is 3. The joint distribution would require storing 48 values (2*2*2*2*3) while the PGM only requires 26 (see notebook 1 for details).

This is what's known as a Bayesian network, which is always represented as a directed acyclic graph. Each node is parameterized by a conditional probability distribution (CPD) like P(node|parents). For example, the Grade node has the CPD P(G|D,I). Bayesian networks are used when you want to represent causal relationships between random variables. Naive Bayes is a special case where all random variables are assumed to be independent of each other, each only directly affecting the target variable.

Given tabular data and a graph structure, CPDs can be estimated using Maximum Likelihood Estimation (MLE). It's similar to what was done with the Iris data in the first code block above. It's also fragile because it is so dependent on the amount and quality of the observed data (see notebook 10 for details). This explains why that code breaks with some random seeds.

A better solution is Bayesian Parameter Estimation. There you start with CPDs based on your prior beliefs (or uniform priors) and update them based on the observed data.

One method of exact inference in PGMs is variable elimination. It efficiently avoids computing the entire joint probability distribution (see notebooks 2 and 5 for details). For larger graphs there are other, approximate algorithms because an exact solution would be intractable.

Making predictions is similar. Instead of getting a distribution we get the most probably state.

Markov networks

Markov networks are represented by undirected graphs. They represent non-causal relationships. They can, however, represent dependencies that a Bayesian model can't, like cycles and bi-directional dependencies. Factors describe connected variable affinity, or how much two nodes agree with each other. The joint probability distribution is the product of all factors.

A quick note because the names sound similar. Markov chains are not PGMs because the nodes are not random variables. They can be represented as as Bayesian networks and PGM algorithms would be available.

Sampling

Sampling algorithms approximate exact inference by generating a large number of samples that will converge to the original distribution. One of these is Hamiltonian Monte Carlo. It is a Markov Chain Monte Carlo (MCMC) that proposes future states in the Markov Chain using Hamilton dynamics from physics (see notebook 8 for details). Other MCMC algorithms you may encounter are Metropolis-Hastings and Gibb's Sampling. See Monte Carlo Approximation Methods: Which one should you choose and when? for a comparison of these methods.

Another interesting find that fits in at this point is the PyMC3 library and the Probabilistic Programming and Bayesian Methods for Hackers open source book.

I also think this is a nice writeup on Bayesian Logistic Regression using Pyro, another probabilistic programming library, and MCMC.

Learning networks

Learning a Bayesian network can be done as an optimization problem by scoring networks on how well they fit a data set, and searching through the space of all possible models. For non-trivial graphs where an exhaustive search is not possible, hill climbing can be used (see notebook 11 for details).

Wrap-up

I've only scratched the surface here, but I think it's a more intuitive introduction to the topic than most of the material in this space. We could build up to more complex graphs and problems from here.

Hidden Markov Models Explained with a Real Life Example and Python code is a nice extension, for example.

And to bring things full circle on where PGMs fit in to the ML landscape, here is an opinion from well-known ML researcher Ian Goodfellow:

The two aren’t mutually exclusive. Most applications of neural nets can be considered graphical models that use neural nets to provide some of the conditional probability distributions. You could argue that the graphical model perspective is growing less useful because so many recent neural models have such simple graph structure … These graphs are not very structured compared to neural models that were popular a few years ago like … But there are some recent models that make a little bit of use of graph structure, like VAEs with auxiliary variables.

Plus a tweet from the Standford NLP group:

There is increasing convergence between this decade’s neural models and last decade’s probabilistic graphical models… https://t.co/MKlIP4Sayx
— Stanford NLP Group (@stanfordnlp) May 15, 2018

Thus is would seem that knowing these concepts will continue to be useful even if we don't directly use PGMs or focus solely on PGMs.

ML at Scale Part 3: distributed compute

noreply@blogger.com (opiethehokie) — Fri, 19 Feb 2021 13:08:00 +0000

In part 2 I focused on ML when your data won't fit in memory. This post will move on to slow, or compute bound ML instead. I'll continue to use Dask and explore how it can help us.

Dask leverages multiple CPU cores to enable efficient parallel computation on a single machine. It can also run on a thousand-machine cluster, breaking up computations and routing them efficiently. The Comparison to Spark documentation is a great reference for understanding Dask in the context of an older tool and that older post. Perhaps the most interesting difference is Spark just being an extension of the MapReduce paradigm and Dask being able to implement more sophisticated algorithms by being generic task scheduling-based.

Sticking with the scikit-learn examples, here is replacing a parallel algorithm's Joblib backend with Dask to (potentially) spread work out across a cluster.

The Dask documentation is quick to point out in their best practices that not everyone needs distributed ML as it has some overhead. Compiling code with Numba or Cython could help, as could intelligently sampling some of your data. In this post I got huge speedups by vectorizing some code that was looping through large matrices.

Across this three part series we've now seen how to speed up reading large datasets, work with datasets that don't fit entirely in memory, and distribute processing across multiple machines. There's obviously a lot more to this, but I wanted to develop a better intuition for how to approach these types of issues and at least know where to start. Hopefully you learned something too.

UPDATE:

A specific tool isn't supposed to be the focus of this post. Dask was used here to illustrate the idea and show how simple it can be, but there are other options in this space. Here are a couple of other examples that I've come across:

tune-sklearn, which is part of the Ray framework for building distributed applications, can do distributed hyperparameter tuning
Distributed training with TensorFlow shows how you can distribute model training within the TensorFlow ecosystem

UPDATE 2:

Both Pandas 2.0 and Polars now use Apache Arrow as a memory model. Polars, a relatively new entrant to this space, is implemented in Rust and exposes a Python API. It is created specifically for fast data processing, not ML, but overlaps enough with this series that it is worth checking out.

UPDATE 3:

Since I mentioned Numba earlier, I'll also mention Jax another, newer library in that space. Jax has a NumPy-like interface, works with GPUs, offers JIT compilation, and supports automatic differentiation, vectorization, and parallelization. Check out their simple NN example to see it in action. It's not replacing distributed compute or even competing with Dask, the point is more that there are now a lot of amazing tools that make working with data easier and faster.

ML at Scale Part 2: memory

noreply@blogger.com (opiethehokie) — Mon, 15 Feb 2021 22:28:00 +0000

When data we want to train a machine learning model on becomes too big to fit in memory, we need to find a way to work on subsets of the data. I hinted at this in part 1. We can use Pandas to read chunks of a file, but that is fairly primitive and slow.

Libraries like Vaex and Dask attempt to abstract this away.

Vaex provides lazy, out-of-core (not all in memory at once) DataFrames via memory-mapping. Pre-processing and feature engineering are more efficient, and memory is freed up for model training. It also has a vaex.ml package which provides a scikit-learn wrapper.

Dask provides large, parallel DataFrames composed of smaller Pandas DataFrames. This helps with data too big to fit in memory because the individual Pandas DataFrames can be stored on disk. DaskML provides estimators designed to work with Dask DataFrames.

The accuracies both came out to 96%. Similar ideas, different implementation. In fact, these DataFrames remind me a little bit of the persistent data structures covered in my Exploring Immutability post.

Even with these fancy DataFrames, many machine learning algorithms are designed train on all the data at once. If our data is too big to fit in memory then that's going to be a problem. As in the code above, we need to use online learning or incremental algorithms to solve this problem. The Incremental and IncrementalPredictor classes handle "streaming" the data (in batches) and we specify upfront the possible classes.

DaskML adds several generalized linear model implementations. scikit-multiflow is designed for actual streaming data and adds several other online learning algorithms. Neural networks are also trained in this manner, often being fed "mini-batches", so they are good candidates for datasets that don't fit in memory as well.

Stay tuned for part 3 where I'll get into being compute-constrained instead of memory-constrained.

UPDATE:

There's a lot of I/O and memory-related work coming out of the TensorFlow community as well. Checkout Better performance with the tf.data API as it overlaps nicely with Parts 1 and 2 of this series.

ML at Scale Part 1: I/O

noreply@blogger.com (opiethehokie) — Sun, 07 Feb 2021 19:48:00 +0000

I recently came across a somewhat large dataset from a Kaggle competition where the data was provided as an approximately 6 GB CSV file. A frequent comment in the discussion forum was how long it took just to read this file. A few GBs is large enough where this starts to become noticeable, but it's not really that big. If a laptop can have a 1+ TB drive and 32 GB memory then this isn't even in the realm of "big data". That's good, though, because it means there are some simple tricks we can use to cut down on that read time.

The pandas read_csv() method takes about 63 seconds for me. That's our baseline.

First we try reducing the precision. This gets us to 59 seconds. Not great.

Next we try reading the files in chunks. This is actually slower, but the technique could help us if the file didn't fit entirely in memory. More on that in future posts.

Then we try Dask which will spread the work across multiple processers. 30 seconds. Better, but I still don't want to wait that long. More on Dask in future posts as well.

Finally we convert the CSV file to a different format. I tried Apache Parquet but there are others. It's a binary columnar format (remember Column-oriented Database Basics?). Stored in this manner the data is 2.5 GB. And this gets us to just 3 seconds for reading the whole file!

Converting our data to the binary file format and possibly reducing the precision or using Dask as well would really shorten our feedback loop while training a ML model. It would seem that any cleaning of the data or preprocessing that we can do ahead of time would make sense to do once, before converting the file format, when the data is this size.

UPDATE:

Apache Arrow is another project to checkout in this space along with memory-mapped files. Reading this data in the Feather file format is even faster than Parquet.

Rust to WebAssembly and Fibonacci Numbers

noreply@blogger.com (opiethehokie) — Sun, 27 Dec 2020 02:20:00 +0000

WebAssembly (wasm) is a virtual assembly language. It's a binary instruction format that can be executed at near native speeds by JavaScript engines like V8, which means it can run in a browser or Node.js app. Not intended as a JavaScript replacement, it instead works with it for performance critical pieces of code. You can make calls from JavaScript to WebAssembly and vice versa. It can be useful for games, VR, and AR, for example.

Rust is a fast and memory-efficient programming language with interesting type and thread-safety characteristics. I've never used it, but have wanted to check it out for a while. Along with Go and C, it can easily be compiled into WebAssembly. This post just touches the surface of Rust.

First, to learn how to use Rust in a JavaScript app, I followed Compiling from Rust to WebAssembly. They guide you through compiling Rust code, generating a JavaScript wrapper package, then using npm and webpack to run it. The end user only needs a modern browser and they can execute the code originally written in Rust none the wiser.

Building on that, as a simple experiment to see how much faster Rust code compiled to WebAssembly can be versus vanilla JavaScript, I compared calculating the nth Fibonacci number in each. I used the inefficient recursive algorithm intentionally, wanting it to be slow.

The function in Rust looks like this:

And the JavaScript code that calls it, times it, and compares it:

Surprisingly, although maybe it shouldn't be, the WebAssembly version of this simple function is orders of magnitude faster when executed in the browser. At n=35, where they really start to diverge, I'm seeing about .06 seconds versus 3.5 seconds.

I don't often have to write CPU-intensive code like this in JavaScript, but with such a stark performance difference possible it's a good tool to have in the toolbox.

GraphQL API Gateway Prototype

noreply@blogger.com (opiethehokie) — Wed, 04 Nov 2020 22:43:00 +0000

I've been meaning to check out GraphQL for a while. At work I'm seeing calls to REST APIs being made to get one small piece of data, most of the response discarded. Or, multiple HTTP requests, usually sequential, whose responses must be combined to do something useful with the retrieved data. Their granularity is wrong for the use case, but maybe not for someone else's. Can GraphQL help with this? It seems like it can. Model the business domain as a graph, like our mental models and object-oriented programming. Engines for many languages. A type system to enable good developer tools. It sounds great.

At the risk of adding another network hop, another moving, part, another layer, it would be nice to be able to call something to get exactly the data I need without changing the existing APIs. Mobile devices or slow internet connections could benefit from the reduced number of round trips. This extra layer could abstract away the different (read poor) uses of HTTP status codes and different response body styles. What I think I really want is a GraphQL API gateway.

The API gateway is a single entry point for all clients (the backends for frontends variation of the pattern has a different gateway optimized for each type of client). Requests can be proxied straight through to a single microservice or fanned out to several microservices. Responses can be aggregated and/or modified. This is the overlap point with GraphQL and why I think they would go well together. The API gateway can also centralize several cross-cutting concerns like throttling, routing, circuit-breaking, input validation, authentication (authorization stays in the business logic), etc.

Apollo is the GraphQL implementation I went with to prototype this. From their tutorial I started with the rocket launch API and extended it with a made up weather API to see how the two could be chained together. To the client, it's seamless. Both "services" are part of the same graph.

This is a lot of code to show in one shot, but I'll explain it below and then show some example GraphQL queries.

First, typeDefs defines the schema. You can query a list of rocket launches or a single launch by ID. dataSources specifies, of course, where the data is coming from, like a database or REST API. resolvers stitches these together. Notice how the Weather type takes a site from its parent, a Launch. This is how they are linked, or chained together.

With the Apollo server running you can try it out at http://localhost:4000/ in a browser. The GraphQL queries are on the left and the responses are on the right.

It's cool to see the different responses without having had to write any code to specifically handle them. A natural extension to this prototype would be to add mutation resolvers so clients could also update the graph.

Finally, the ThoughtWorks Tech Radar cautions against trying to create a universal, canonical, centralized data model. I think the bounded context ideas from DDD would apply. In their zero trust architecture blurb they also mention that a network perimeter isn't a security boundary anymore. That makes me question thinking of the API gateway as a place to shift all those cross-cutting concerns to. Do users have to go through the gateway or can they hit microservices directly? They mention service mesh as a solution and that would seem to be an API gateway alternative, but they aren't mutually exclusive either.

Column-oriented Database Basics

noreply@blogger.com (opiethehokie) — Sat, 13 Jun 2020 21:06:00 +0000

This is a short post on column-oriented databases. I'll barely scratch the surface, but among the types of NoSQL databases—document, key-value, column-oriented and graph—I've always thought column-oriented was the most difficult to wrap my head around. Hopefully we can get past that initial hurdle here, run a few queries, and they will seem like less of a mystery.

A relational database is optimized for retrieving rows of data. This works well for transactional applications. A column-oriented database is optimized for retrieving columns of data. This works well for analytical applications and some queries, like aggregations, become really fast because much less data needs to be read from disk to retrieve the whole column. There seems to be a lot of overlap between column-oriented databases and data warehouses.

A relational database would store 3 rows of data like this:

1:a,b,c;2:d,e,f;3:g,h,i

While a column-oriented database would store the same data like this:

a:1,d:2,g:3;b:1,e:2,h:3;c:1,f:2,i:3

For a low-cardinality column, compression algorithms work very well. Something like a:2,a:3 becomes a:2,3. In some ways this is like normalizing a relational database to reduce data duplication, but I don't think that enables the same level of compression or gives you the same data locality benefits.

Of course column-oriented databases aren't good for all workloads. They aren't optimized for queries that touch many fields. Writes can also be slow since they aren't just appending to the end of a file.

So where do they fit into a big data architecture? When thinking about this I remembered the article How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh. It avoids mentioning specific products, but they way I interpret it is that their first generation was dedicated data warehouse products where you did ETL. The second generation was more ELT, maybe using Hadoop and Spark. The third generation, the eponymous data mesh, unifies batch and stream processing, perhaps adding Kafka to the above mix. Using the CQRS pattern, for example, a column-oriented database could fulfill some of the Q (query) operations as a sort of data warehouse.

For my first column-oriented database adventure, after that background research, I chose Amazon Redshift and I followed their Getting Started with Amazon Redshift guide. It only took about an hour to spin up a Redshift cluster, upload their sample data to an S3 bucket, copy that into Redshift, then run a few SQL queries.

The big surprise and takeaway was that this felt just like using a relational database. The data I uploaded was in text files (basically CSV files) and was used to create tables. The queries I ran were SQL queries that joined tables, selected different fields and aggregated the results. The details of how the database works under the covers is interesting, but doesn't inform its usage. There isn't much mystery after all.

SELECT * FROM event;

SELECT firstname, lastname, total_quantity FROM (SELECT buyerid, sum(qtysold) total_quantity FROM sales GROUP BY buyerid ORDER BY total_quantity desc limit 10) Q, users WHERE Q.buyerid = userid ORDER BY Q.total_quantity desc;

A couple of other column-oriented databases you may have heard of are Cassandra and HBase. NoSQL databases are built to scale horizontally so you also need to consider how the CAP Theorem applies to your situation when choosing among them as they each make different trade-offs in addition to their unique features. The best choice will be highly data and workload specific.

Dynamic Programming

noreply@blogger.com (opiethehokie) — Sat, 25 Apr 2020 15:34:00 +0000

Optimal substructure means that the optimal solution can be constructed from optimal solutions of sub-problems. Think recursion. When we combine optimal solutions to non-overlapping sub-problems the algorithmic technique is called divide and conquer. Think of a merge sort where each sub-sort can happen independently and they are combined at the end. When we combine optimal solutions to overlapping sub-problems the algorithmic technique we can use is dynamic programming. Because the sub-problems overlap, we can improve the running time by remembering the results of already computed sub-problems and not computing them again.

An easy way to visualize this is in calculating Fibonacci numbers. The standard recursive solution can be memoized to significantly improve time complexity at the cost of using more space to store sub-problem results. This is top-down dynamic programming—recursion and memoization. Top-down dynamic programming can be easier to program and anecdotally it is more commonly used.

Bottom-up dynamic programming is instead iterative and we solve sub-problems first, building them into bigger sub-problems. Without the overhead of recursive calls we don't necessarily increase space complexity, but still have the decrease in time complexity. For calculating the nth Fibonacci number this looks like:

This is time complexity O(n) and space complexity O(1), a huge improvement on the naive recursive solution that's O(2ⁿ) time and O(n) space. For details on big-O of recursive algorithms read this.

Some other interesting applications of dynamic programming are:

Dijkstra's algorithm (shortest paths)
Tower of Hanoi (teaching game)
Knapsack problem (combinatorial optimization)

The knapsack problem, for example, has a top-down solution but I think the bottom-up solution is especially appealing. Using a matrix to store sub-problem solutions we can make the O(2ⁿ) time recursive algorithm O(nW) time and space:

But wait, there's more. We only need to remember a part of the matrix for the next iteration, which means we don't even need the matrix. It can be further optimized to only use O(W) space:

Not, in my opinion, an obvious solution by any means, but a very elegant one.

Quantum Teleportation

noreply@blogger.com (opiethehokie) — Sat, 11 Apr 2020 19:32:00 +0000

I recently checked out the IBM Quantum Experience and was really impressed with the content available for getting started with quantum computing. There are simulators for executing quantum circuits, runnable both locally and in the cloud, but the highlight is being able to submit jobs to run on a real quantum computer. To create these circuits we use the Python DSL Qiskit.

For gaining a better understanding of Qiskit I recommend starting with the Coding with Qiskit video series. Following along with the videos and coding my first circuit then led down a rabbit hole of trying to understand what I was really building and how it could be useful.

Quantum computing uses phenomena from quantum mechanics like superposition and entanglement to perform computation. In classical computing we have bits which can either be on or off, 1 or 0. In quantum computing we have qubits which can be on or off at the same time, or somewhere in between. This is superposition. I like the coin analogy from Quantum computers will change the world (if they work) of this being a coin flip (classical) versus a spinning coin (quantum).

A qubit is represented as a vector of two complex numbers with length 1 where these represent the probability of a 0 or 1 state. Measurements destroy quantum state, collapsing the qubit to a classical bit. A measurement cannot be reversed. A quantum state cannot be copied like a classical bit's state can. Instead, quantum teleportation is the transfer of state from one qubit to another via a linkage known as entanglement. Even if moved far apart, entangled particles transmit information to one another. One coin flip mirroring another, linked flip.

The probabilistic nature of quantum computing enables many calculations to be processed simultaneously, thus making some intractable problems tractable. Quantum algorithms use superposition or entanglement to solve problems like factorization or search faster than classical algorithms. Qiskit has some amazing Jupyter notebooks showing how these can be used. Three I thought looked the most interesting, for example:

My first circuit is much simpler, of course, but I feel good about getting it running. Based on the videos linked above, I teleport a 1 state from one qubit to another and then measure it to confirm. I simulate this locally, in the cloud, and then run it on one of IBM's quantum computers. The actual quantum computer run includes some noise which is from the imperfect nature of today's quantum computers.

UPDATE:

TensorFlow Quantum is another library that could be used if you want to leverage Google's quantum computing resources.

Geometric Brownian Motion

noreply@blogger.com (opiethehokie) — Sat, 28 Mar 2020 21:59:00 +0000

Geometric Brownian Motion is a stochastic process that can be used to model stock prices. It's related to random walks and Markov chains. This is a much different way to look at time series than what I explored in my Time Series Predictions post and given the recent market volatility it seems especially timely to take a closer look at it.

There are detailed explanations of the math and theory elsewhere. I've just written some code to see it in action. This generates 100 simulations for the S&P 500 ETF with the ticker SPY, modeling the past year:

What's especially interesting today is that the current value of SPY is on the extreme lower end of what we see simulated. Even increasing 100 to 1000 or more, we are still on a rare path.

There are some known problems with the model that may help explain this. First, volatility isn't really constant in financial markets. Second, the randomness in GBM is normally distributed but we know that stock returns are not. They have fatter tails, or higher kurtosis. Also stock prices react to specific geopolitical events that definitely are not random, even opening a day at a different level than the previous day's closing.

Out of curiosity, and because the second issue is the easiest to tweak, I replaced the normal distribution with a Laplace distribution and see a slightly wider dispersion of results (in both directions). In reality, though, the 52 week range is 218.26 - 339.08 so we still aren't capturing the extremes witnessed.

Any ideas on what's going on here? It must have something to do with the massive increase in volatility at the end of what was otherwise a calm year. Please comment.

Time Series Predictions

noreply@blogger.com (opiethehokie) — Sat, 14 Mar 2020 21:38:00 +0000

Time series data is just a series of observations ordered in time. As simple as it sounds, there are important differences when analyzing time series data vs. cross-sectional data. This post will attempt to cover enough basics, from statistics and machine learning, to get to a point where we can forecast future observations.

First, some terminology. Data is autocorrelated when there are similarities between an observation and previous observations. Seasonality is when the similarities occur at regular intervals. Trend is a long-term upward or downward movement. And data is stationary when its statistical properties, like mean and variance, do not change over time.

I'll generate data with these characteristics to use for the rest of the post:

Detecting stationarity

While time series data is usually not stationary, stationarity is important because most statistical models and tests have that assumption. The Augmented Dickey-Fuller (ADF) test can be used on normally distributed data to detect stationarity. The null hypothesis is that the data is not stationary, thus you are looking to reject it with a certain level of confidence.

There are other (non-parametric) stationarity tests without the normally distributed data assumption that are beyond the scope of this post.

Transformations

By applying different transformations to our data we can make non-stationary data stationary. One approach is to subtract the rolling mean or weighted rolling mean (favoring more recent observations) from the data. A another approach is called differencing. Subtract the difference from some time period ago, like a month or a week, from the data.

Forecasting

Special care must be taken when splitting time series data into a training and a test set. The order must be preserved, the data can not be reshuffled. For cross-validation, it is also important to evaluate the model only on future observations so a variation of k-fold is needed.

SARIMA

Seasonal autoregressive integrated moving average (SARIMA) is a model that can be fitted via a Kalman filter to time series data. It accounts for seasonality and trend by differencing the data, however it is a linear model so an observation needs to be a linear combination of past observations. A log or square root transform, for example, might help make the time series linear.

RNN

A recurrent neural network (RNN) with long short-term memory (LSTM) is an alternative to SARIMA for modeling time series data. At the cost of complexity, it can handle non-linear data or data that isn't normally distributed.

I didn't put a lot of effort into tuning these models, or coming up with additional features, and they aren't perfect, but we can start to get a feel for how they work. The SARIMA model looks underfit. It did, however, nicely ignore the randomness in the data. The RNN model clearly overfits the data and more work would be needed to get a smoother curve.

This was my first attempt at working with SARIMAX and RNNs so any feedback is appreciated.

Probability Problems Programmatically

noreply@blogger.com (opiethehokie) — Sat, 06 Apr 2019 22:35:00 +0000

Birthday Problem

How big of a group of people do you need so that there is a 50% chance 2 of them share a birthday? How about 99.9%? Of course, 100% is 366 people by the pigeon principle, but the small numbers for lower probabilities is surprising.

The key is that birthday comparisons are made between every pair of individuals and not just between one individual and everyone else. Then the solution becomes p(n) = 1 - ₃₆₅P_n/365ⁿ. At n=23, the probability of two individuals having the same birthday is about 50.7%.

If we didn't know how to arrive at this solution, we could use a simple Monte Carlo simulation to estimate the probability:

Monte Hall Problem
You are on a game show and are given the opportunity to select one of 3 doors. Behind one door, known to the host, is a major prize. Behind the other 2 doors are goats. Once you've made a selection the host will open a door showing a goat. Now you have the opportunity to select a different door. Should you switch or keep your original guess?

This problem is very counter-intuitive and seems to cause disagreement among otherwise intelligent people even after the correct solution is known. By switching doors, you dramatically increase your chance of winning the prize from 1/3 to 2/3.

Again, instead of worrying too much about how to arrive at the correct solution we can simulate it:

Determining if a coin is fair

Out of 1000 coin flips, how many heads can you get and still assume with 95% confidence that it's a fair coin?

Coins flips are a binomial distribution, but can be approximated as a normal distribution based on the central limit theorem. A 95% confidence interval is ± 1.96 standard deviations. The variance in 1000 flips is 15.8. 530 heads is (530-500)/15.8 or ± 1.9 standard deviations.

Rain in Seattle Problem

You call 3 friends and ask them if it's raining. Each friend has a 1/3 chance of lying and 2/3 chance of telling the truth. All 3 friends say it's raining. What is the probability it's actually raining?
If we say it's raining 10% of the time, then our probability is .1 * (2/3)3 / (.1 * (2/3)3 + .9 * (1/3)3) by Baye's theorem. The answer is 47%.

These simple Monte Carlo simulations are a way to get an approximate answer when we don't know how to (or don't want to) calculate the exact answer. These were all just a few lines of code, took a few minutes to write, and run nearly instantly. It's a great way to verify the above answers as well. And to get a more exact answer, you just trade off the quick run time with doing more than 10,000 simulations.

Linear Programming

noreply@blogger.com (opiethehokie) — Wed, 11 Jul 2018 00:53:00 +0000

Linear programming (LP) is a technique for the optimization of a linear objective function. Integer programming (IP) has the additional constraint of the unknown variables being integers. I vaguely remember these from school, but they are terms I see every so often so I wanted to code something up and get reacquainted.

Many LP solvers exist and a quick search for a Python library turned me onto PuLP. They already had a Sudoku example, which is what I wanted to do, so I've turned that into a Jupyter notebook (getting one of these embedded in a post is something else I've been wanting to try) to make it a bit easier to follow.

LP has polynomial time solutions (see Karmarkar's algorithm), while IP is NP-hard (see branch-and-bound). With Sudoku being a relatively small problem it doesn't seem to matter. These puzzles are easily solvable in under 1 second.

The linear programming I remember was graphing several lines and getting a solution where they intersect. Applying the technique to Sudoku is way more fun. And there are obviously other more useful applications like portfolio optimization, airline crew scheduling, or vehicle routing.

While working on this I also came across Google Optimization Tools which I haven't heard of before. I don't know how popular PuLP is, so it could be another, perhaps better supported, alternative.

UPDATE:

CVXPY also looks promising as demonstrated in Optimization with Python: How to make the most amount of money with the least amount of risk.

Integer vs Linear Programming in Python has a nice comparison of ML to linear programming that will help you decide when to use each.

Python 3 asyncio

noreply@blogger.com (opiethehokie) — Sun, 13 Aug 2017 21:43:00 +0000

A while back I wrote a few posts about asynchronous programming:

So when I learned that Python 3.5 has added async and await operations I knew I had to check it out to see how it compares. asyncio describes itself as "infrastructure for writing single-threaded concurrent code using coroutines, multiplexing I/O access over sockets and other resources, running network clients and servers, and other related primitives".

Coroutines in Python are similar to generators, but coroutines (a function definition using async def) can control where execution continues after the yield (replaced by await). You await a another coroutine or a future. The coroutine approach can replace callbacks.

If we check the time it takes for the program to run, it's 3 seconds (the longest slow operation) and not 6 seconds (the sum of all the slow operations).

Slow operation sleep 1 complete

Slow operation sleep 2 complete

Slow operation sleep 3 complete

Completed in 3.02 seconds

It's similar if we want to get results from each task. We're even able to get them as they are available.

Slow operation sleep 1 complete

Got result 1

Slow operation sleep 2 complete

Got result 2

Slow operation sleep 3 complete

Got result 3

Completed in 3.00 seconds

With Python's GIL you've never really been able to run multiple threads in parallel. You've had to run concurrent code in multiple processes to leverage multiple CPU cores. With asyncio we are able to at least make single process IO-bound tasks execute faster because it switches between tasks to bypass GIL contention.

For CPU-bound code you still have to use multiple processes to parallelize your code. Python's parallel API limits your ability to use results from these tasks as they become available, as we did in the example above. Waiting for a process to finish and getting a future result both block. asyncio gives us a way to unify the concurrent and parallel APIs.

Slow operation sum 10000000 complete

Got result 49999995000000

Slow operation sum 20000000 complete

Got result 199999990000000

Slow operation sum 30000000 complete

Got result 449999985000000

Completed in 2.91 seconds

There is some overhead in creating the processes but a single 3 second task is only slightly faster than parallel 1, 2, and 3 second tasks.

Slow operation sum 30000000 complete

Got result 449999985000000

Completed in 2.66 seconds

This hints at another use. Given that asyncio uses a single thread, if there are too many IO-bound tasks or if any of them consumes too much CPU, we can overwhelm it. For such a situation it is possible to create multiple processes, each with its own asyncio event loop.

1 process and 3 tasks:

Got 3 results in 6.20 seconds

1 process and 15 tasks:

Got 15 results in 23.94 seconds

The simulated mix of IO-bound and CPU-bound code is interesting. It's slower than just the CPU-bound code on it's own, but faster than the sum of the IO-bound and CPU-bound code. We see some asyncio benefits up to around 15 tasks and then it levels off.

1 process and 30 tasks:

Got 30 results in 46.58 seconds

4 processes and 15*4 tasks:

Got 60 results in 34.11 seconds

8 processes and 15*8 tasks:

Got 120 results in 52.81 seconds

We similarly see some benefit of parallelizing the code up the number of CPU cores I have, then the time is increasing linearly as we would expect.

16 processes and 15*16 tasks:

Got 120 results in 105.98 seconds

If you are looking for more details the comprehensive Python Concurrency series of articles has additional examples like this and goes into more detail on some of the underlying concepts.

Vectorization and Eigenvectors: Sports Rating Examples

noreply@blogger.com (opiethehokie) — Thu, 23 Feb 2017 13:01:00 +0000

While calculating team ratings for a machine learning-based March Madness prediction, I ran into a couple of situations where my code got slow as I expanded it to include all the teams over all the seasons. By slow I mean several hours, and that was longer than I was willing to wait. I needed to be able to recalculate ratings on-demand, in a few minutes at most.

For my offensive-defensive rating, I started with an implementation similar to the one in Offensive and defensive team ratings for the Premier League 2014-2015:

It's looping over all the rows and columns in the matrix many times. With college basketball having over three-hundred teams, this wasn't going to work for me. I figured out that it could be vectorized and that numpy could handle it more efficiently than me:

Not only is it faster, but I would argue the more concise code is easier to understand as well.

For my Markov Chain ranking, I started with an implementation similar to the one in A Markov Chain ranking of Premier League teams (14/15 season):

I don't mean to disparage these two posts in any way. They are awesome and really helped me. I just had a different situation and had to worry about performance, and here I figured out that I could get rid of both loops:

Eigenvectors to the rescue. The eigenvector for the largest eigenvalue is the stationary distribution we are trying to find and we don't need to do the 100,000 step random walk.

Big-O notation for recursive algorithms

noreply@blogger.com (opiethehokie) — Tue, 31 May 2016 02:09:00 +0000

I once learned of a way to figure out the big-O^* for recursive algorithms and in today's post I just wanted to go back and learn it again. Apparently it's call the master theorem.

Basically you take the recursive algorithm and let a be the number of sub-problems in the recursion, b the size of each sub-problem, and f(n) the work done outside the recursive calls.

Formulate T(n) as aT(n/b) + f(n). T(n) can't be something like sin n (must be monotonic) and f(n) can't be something that's not polynomial.

If f(n) ∈ O(n^d) then the big-O for T is:

O(n^d) if a < b^d
O(n^d log n) if a = b^d
O(n^{log_b a}) if a > b^d

As an example, lets apply this to the merge sort. There are two sub-problems so a = 2. The size of each sub-problem is half of n, so b = 2. Outside of the recursive calls there is one pass through the data so d = 1. This leaves us with T(n) = 2T(n/2) + n which is O(n log n).

For binary search, it would be a = 1, b = 2, and d = 0. Therefore O(log n).

* Technically I think this should be big-theta but informally people only seem to use big-O so I'm sticking with that.

Twelve-Factor ASP.NET Core Apps

noreply@blogger.com (opiethehokie) — Fri, 18 Mar 2016 03:27:00 +0000

I had never done any .NET development until this past year when I started to work with ASP.NET Core. Microsoft says it's cloud-ready and I agree, but I haven't seen anyone describe what a twelve-factor ASP.NET Core app looks like yet. In the Cloud Foundry (CF) and Heroku communities twelve-factor apps are a popular topic. This is my take on it for ASP.NET Core.

The twelve factors are best practices for anyone who works on a software-as-a-service app. They help your app be portable, continuously deployed, horizontally scaled, and easier to develop. Basically cloud-native. I'm most familiar with the CF platform-as-a-service, so assume that's the execution environment of the app in my examples.

Codebase: The app is tracked in version control.

Store each app in its own git repository
Factor shared code into libraries that become dependencies

Dependencies: The app declares all of it's dependencies and never relies on the implicit existence of system-wide libraries or tools.

Use NuGet package manager for distribution
Declare dependencies in project.json
Dependency isolation is provided by the automatic creation of project.lock.json

Config: Config for anything that varies between deploys as should be stored as environment variables.

Cloud Foundry provides service credentials in the VCAP_SERVICES environment variable
Use environment variables for other config like whether the environment is dev or prod, or logging levels

Backing services: There is no distinction between local and remote services, they are all just resources.

Based on the config, add services to the DI container in the ConfigureServices method of Startup.cs

Build, release, run: A build transports the code repository to an executable bundle, release takes the build output and combines it with the config, and run starts the app.

I think the build could be a dnu publish or the buildpack compile phase
I think the release could be pushing the app to CF or the buildpack release phase
The run (regardless of how you prefer to think of build and release) is CF starting the app and creating a route
Provide a rollback mechanism by using blue-green deploys and canary testing (there are CF CLI plugins for this)

Processes: The app is stateless and doesn't rely on sticky sessions.

Use a backing service like Redis for HTTP sessions
Don't write to the local filesystem because it's ephemeral and not shared between the multiple instances of the app

Port binding: The app is self-contained. It listens on a port specified by the execution environment.

Declare a dependency on Kestrel
Kestrel can be configured with the --server.urls command-line option to run on a specific host and port

Concurrency: The app scales horizontally because nothing is shared.

Manual CF CLI scale and auto-scaling based on different metrics
CF restarts app on crashes and provides an API to start and stop the app
Run at least 3 app instances to keep the app running during data center and CF updates

Disposability: The app starts quickly, shuts down gracefully, and handles unexpected non-graceful terminations.

When an application is deployed the dependencies including the runtime are cached and re-used to create additional instances of the application
Implement a shutdown handler that can run when CF stops your app
Use message queues to communicate between applications and keep track of work for non-graceful terminations

Dev/prod parity: The development, staging, and production environments are as similar as possible.

Can do development on Linux to match OS in CF
Can run ASP.NET Core apps in Docker containers even using the cflinuxfs2 image to get really close to CF, for example when running tests
Can use Kestrel web server in integration tests and production

Logs: Logs are a stream of events and can be consumed by a data warehouse, log indexing, or log analysis service.

CF apps can log to STDOUT/STDERR and the log messages are buffered in memory where they can be tailed or dumped
Log messages can be persisted by providing a syslog URL for a log service and they will be drained there

Admin processes: One-off admin processes are executed in the same environment and with the same configuration as the app.

Control the app's start command in a manifest.yml file that is version controlled with the app like dnx ef database update && dnx web ...
Employ worker apps to run your admin processes and don't assign a route to them

When you dig into these you realize that many of them can actually be handled by a PaaS like Cloud Foundry or by an application's runtime like ASP.NET Core. That's good because then the app can focus on the problem it's trying to solve. The app developers just need to be aware of and know how to leverage what's provided. These folks should study the twelve-factor site as it has much more detail and examples than included here.

Check out the ASP.NET Core Cloud Foundry buildpack mentioned above (full disclosure: I am a contributor to this).

Something else to keep an eye on is Steel Toe OSS. The twelve factors don't cover some microservice patterns like service discovery, or latency and fault tolerance patterns like circuit breakers and bulkheads. The .NET world seems to currently be far behind Java in this area, so it's good to see something in the works.

Reactive Actors

noreply@blogger.com (opiethehokie) — Mon, 28 Dec 2015 21:54:00 +0000

I've been meaning to revisit reactive programming and the actor model for a while now. I first learned about them in the Principles of Reactive Programming Coursera class and then actors came up again in the Seven Concurrency Models in Seven Weeks book. The Scala I picked up is quickly being forgotten and I haven't done a post with code in a while, so here I'll get back into that and create a simple application using Akka and RxScala.

Actor model

The developerWorks article JVM Concurrency: Acting asynchronously with Akka gives a good introduction to the actor model:

The actor model for concurrent computations builds up systems based on primitives called actors. Actors take actions in response to inputs called messages. Actions can include changing the actor's own internal state as well as sending off other messages and even creating other actors. All messages are delivered asynchronously, thereby decoupling message senders from receivers. Because of this decoupling, actor systems are inherently concurrent: Any actors that have input messages available can be executed in parallel, without restriction.

Then JVM Concurrency: Building actor applications with Akka goes on to explain the advantages of this approach:

If you compose your actors and messages correctly, you end up with a system in which most things happen asynchronously. Asynchronous operation is harder to understand than a linear approach, but it pays off in scalability. Highly asynchronous programs are better able to use increased system resources (for example, memory and processors) either to accomplish a particular task more quickly or to handle more instances of the task in parallel. With Akka, you can even extend this scalability across multiple systems, by using remoting to work with distributed actors.

At first the actor model may sound the same as what I described in my Communicating Sequential Processes post because both involve message passing, but the two concurrency models have several differences:

Actors have identities while CSPs are anonymous
Actors transmit messages to named actors while CSPs transmit messages using channels
Actors transmit messages asynchronously while CSPs can't transmit a message until the sender is ready to receive it

My impression, and I could be wrong, is that actors more naturally extend beyond a single machine to a distributed system since the sending and receiving of messages is decoupled. A quick search does turn up distributed channels in pycsp, though, so it seems that both can be distributed.

Reactive applications

The Reactive Manifesto details four qualities of reactive applications:

responsive - the system responds in a timely manner if at all possible
resilient - the system stays responsive in the face of failure
elastic - the system stays responsive under varying workloads
message driven - the system relies on asynchronous message passing between components

Where Akka describes itself as a toolkit and runtime, RxScala only claims to be a library for composing asynchronous and event-based programs using observable sequences. To me it's not clear how it helps us achieve all four qualities (or if it even intends to). Nevertheless, the ReactiveX introduction explains their advantages:

The ReactiveX Observable model allows you to treat streams of asynchronous events with the same sort of simple, composable operations that you use for collections of data items like arrays. It frees you from tangled webs of callbacks, and thereby makes your code more readable and less prone to bugs.

This means is that the methods returning Observables can be implemented using thread pools, non-blocking I/O, actors, or anything else. This is how ReactiveX and Akka will be used together: Actors are the concurrency implementation for services communicating with asynchronous messages.

Combining Akka and RxScala

I came up with the following short example. First I wrote a couple of methods returning an Observable to get a feel for it, then added the stockQuote() method which also uses an actor in it's implementation:

Running it produces the expected output, something like:

broken service

GOOG: 253.22

I can really see the potential in the Observable model, especially after reading more about it at The Netflix Tech Blog. If you were already using actors maybe combining them like this could make sense. I also need to checkout Akka Streams which seems like a similar idea.

UPDATE

Ray is a framework for parallelizing ML workloads. They use the actor model as a way of coordinating work and maintaining state.

Bitwise operations 101

noreply@blogger.com (opiethehokie) — Fri, 04 Sep 2015 22:17:00 +0000

Whenever I see lists of interview questions it seems like bitwise operations are on there. I've never used these at work and I don't even remember needing to do it in school. I wanted to see if I could answer a few of these. After a quick review of what each operation does this is what I came up with.

The first three methods--nthBitSet, countBits, and isPalindrome--are fairly intuitive and I feel like they could be reasonable interview questions.

For adding and multiplying I cheated and looked up the algorithms. They sort of make sense, like the repeated additions for multiplication, but it would have taken a long time to come up with that on my own.