Aaron Toponce

How do you pronounce "daemon"?

Aaron Toponce — Tue, 24 May 2022 22:32:04 +0000

As a Linux system administrator, I am familiar with the definition of the term "daemon" as a process running on a system, usually backgrounded, that provides some sort of service, like a web server or file store. When I first started my career, I prounounced the "ae" with the long "a" sound as "day-mon". It seems everyone I interacted with agreed with me.

Then I got a new job at a startup when my boss heard my use the term. He pulled my into his office to correct my behavior:

"Daemon" is pronounced with a long "e", similar to how you pronounce "Caesar". When you see "ae" together, it's always with a long "e". The term itself goes back to Greek mythology. The spelling is the older spelling of "demon" and they are pronounced the same.

This was around 2011. What he said resonated with me and I found other words with "ae" that had the long "e" sound, even if they were spelled archaically:

Archaeology
Caesar
Encyclopaedia
Haemoglobin
Orthopaedic
...

I was a convert. So much so that I was giving a presentation on systemd administration to other system administrators. At one point while discussing service units, I mentioned that "oh, by the way, the word spelled d-a-e-m-o-n is pronounced dee-mon, not day-mon. Think Caesar, encyclopaedia, archaeology, haemoglobin, etc." Well, one attendee in the presentation wasn't satisfied. When I finished and opened the remaining time for questions, he asked:

What do you call a dessert with bananas, ice cream, chocolate syrup, sprinkles, and a cherry?

Of course the answer is a "sundae" and the rest of the attendees got a good chuckle, as I was caught off-guard and didn't have a retort. But that got me thinking: just how many "ae" words exist in English, archaic spelling or modern, and how are they pronounced? An easy regular expression search on my Debian system returns that answer:

$ grep -P "(?\!.*'s$)ae" /usr/share/dict/words

I get 205 results in my search; your mileage may vary. However, when I scan the results, here are some that stand out. Obviously this isn't an exhaustive list, and this is also how I would pronounce them which follows General American pronunciation. Recieved Pronunciation or other English dialects my pronounce them differently. Archaic spellings are italicized:

Long "a"
- aerate
- antennae
- Gaelic
- Praetorian
- pupae
- reggae
- sundae
Long "e"
- aeon
- algae
- archaeology
- Caesar
- encyclopaedia
- haemoglobin
- orthopaedic
Short "e"
- aero-(bic, dynamic, nautics, sol, space)
- aery
- haemorrhage
Short "i"
- caesarean (first "ae")
- Michael
- Rachael
Syllable separator
- caesarean (second "ae")
- Ishmael
- Israel
- Kafkaesque

I think it's fair to say that the pronunciation of "ae" is varied. Is there a single authoritative source we can use that settles the debate? I don't know, but I'm willing to trust Wiktionary on this one:

Etymology 1: A borrowing of Latin daemon ("tutelary deity"), from Ancient Greek Î´Î±Î¯Î¼Ï‰Î½ (daÃmÅn, "dispenser, tutelary deity")
Pronunciation: IPA: /ËˆdiË.mÉn/
Etymology 2: From Maxwell's demon; a derivation from "disk and execution monitor" is generally considered a backronym.
Pronunciation: IPA: /ËˆdiËmÉn/, /ËˆdeÉªmÉn/

In the case of "Etymology 1", the pronunciation for our Latin deity for both Received Pronunciation and General American is strongly "dee-mon". However, is the case of "Etymology 2", our pronunciation is both "dee-mon" and "day-mon" according to IPA.

Wikipedia's article on "daemon (computing software)" agrees:

In modern usage, the word daemon is pronounced /ËˆdiËmÉn/ DEE-mÉn. In the context of computer software, the original pronunciation /ËˆdiËmÉn/ has drifted to /ËˆdeÉªmÉn/ DAY-mÉn for some speakers.

So, "dee-mon" or "day-mon", you're not wrong.

Solving Wordle with Regular Expressions

Aaron Toponce — Tue, 25 Jan 2022 07:01:57 +0000

It seems like everyone is solving the daily Wordle puzzle these days, myself included. If you're not familiar with Wordle, it's a simple, fun, stress-free game. Your goal is to guess the secret 5-letter word within six tries. The rules are really quite simple:

Gray letters are not in the solution.
Yellow letters are in the solution, but in a different position (or positions).
Green letters are in the solution and in the correct position.

That's it. Let's look at the puzzle for January 1, 2022 from first guess to last to see how it works:

Here we solved the puzzle in 4 guesses:

"STARE": "S", "R", and "E" are in the solution, but in different positions. "T" and "A" are not in the solution.
"EUROS": "E", "U", and "R" are in the solution, but in different positions. "O" is not in the solution. "S" is the last character in the solution.
"URGES": "U", "R", and "E" are in the solution, but in different positions. "G" is not in the solution. "S" is the last character in the solution.
"REBUS" is correctly guessed as the solution.

I'm always curious if I can stretch my skills as a system administrator and software developer, and one of those critical skills is regular expressions. While I'm good with the basics, such as line and word anchors, character sets, quantification, and simple back references, I wanted to see if I could dig deeper using the /usr/share/dict/words file on my Debian laptop.

First off, I already know I can get a list of all 5-character alphabetic words ignoring proper nouns with the following regex:

$ grep -P '^[a-z]{5}$' /usr/share/dict/words

This returns 4,681 unique words on my system. I just need to pick a good starting word that increases my chances of guessing the solution quickly. Knowing about character frequency, I can pick a word that maximizes frequently used characters in English. A solid pick would be something that uses "E", "S", "T", "A", "I", "R", etc. As such, I always start with "STARE".

The clues tell us immediately that "S", "R", and "E" are in the solution, just in different positions. We also learn that "T" and "A" are not in the solution. So my limited knowledge of regular expressions would have me doing something like this:

$ grep -P '^[a-z]{5}$' /usr/share/dict/words | grep 's' | grep 'r' | grep 'e' | grep '[^ta]'

Surely regular expressions can handle this logic without unnecessary pipes. The obvious question is whether or not PCRE supports logical AND. Basically, "match words that have 'S' and 'R' and 'E' and not 'T' and not 'A'". The answer is "yes" using lookarounds.

Regular expression lookarounds come in four flavors:

(?=foo): Look ahead. "foo" immediately follows the current position in the string.
(?: Look behind. "bar" immediately precedes the current position in the string.
(?!baz): Negative look ahead. What immediately follows the current position in the string is not "baz".
(?: Negative look behind. What immediatley precedes the current position in the string is not "qux".

I want "S", and "R" and "E" to appear in the string. So I can use look aheads to make that possible. Because "foo" can be any regular expression pattern, then I can do:

$ grep -P '^(?=.*s.*)(?=.*r.*)(?=.*e.*)[a-z]{5}$' /usr/share/dict/words | grep '[^ta]'

Further, I know that "T" and "A" are not in the solution, and further I know that "S" will not be in the 1st position, "R" will not be in the 4th, and "E" will not be in the 5th. So, I can make five character classes with negative matches. Our final regular expression looks like:

$ grep -P '^(?=.*s.*)(?=.*r.*)(?=.*e.*)(?=[a-z]{5})[^sta][^ta][^ta][^tar][^tae]$' /usr/share/dict/words

This returns 80 words, but I can fiddle it down a touch further (for the moment) by removing words that have repeated characters. For example, "preps" is returned in that list, but repeats "p". In order to maximize my win ratio, I should probably be taking advantage of words with strictly unique characters. So I'll add the negative look ahead of (?!.*(.).*\1) to my list:

$ grep -P '^(?=.*s.*)(?=.*r.*)(?=.*e.*)(?=[a-z]{5})(?!.*(.).*\1)[^sta][^ta][^ta][^tar][^tae]$' /usr/share/dict/words

This returns 66 words which is easier to parse. I should probably pick something high on the character frequency list. "EUROS" seems like a good pick, so let's try that:

The clues this time tell us that "E", "U", and "R" are in the solution, but in different positions, "O" is not in the solution, and "S" was correctly picked for its position. We can modify our expression above with the new data. We know that for the:

First character: Cannot be "S", "T", "A", "E", or "O".
Second character: Cannot be "T", "A", "U", or "O".
Third character: Cannot be "T", "A", "R", or "O".
Fourth character: Cannot be "T", "A", "R", or "O".
Fifth character: "S".

So we modify our expression:

$ grep -P '^(?=.*r.*)(?=.*e.*)(?=.*u.*)(?=[a-z]{5})(?!.*(.).*\1)[^staeo][^tauo][^taro][^taro]s$' /usr/share/dict/words

We're down to three words. Now we still have that filter that no word can have a repeating character. It might be worthwhile to remove it as our returned list is manageable. Interestingly enough, removing that filter returns the same three words.

$ grep -P '^(?=.*r.*)(?=.*e.*)(?=.*u.*)(?=[a-z]{5})[^staeo][^tauo][^taro][^taro]s$' /usr/share/dict/words

We know the final word is either "GRUES" or "REBUS" at this point, and I have 3 guesses left to win. But let's see if updating the regular expression reveals the solution. Again, given the clues, we know "U", "R", and "E" are in the solution, but in different spots. "G" is not in the solution, and "S" is the final character. Knowing that "GRUES" has "G", then the solution should be "REBUS". Let's see if regex agrees:

$ grep -P '^(?=.*r.*)(?=.*e.*)(?=.*u.*)(?=[a-z]{5})[^staeoug][^tauorg][^tarog][^taroeg]s$' /usr/share/dict/words

Indeed it does. "REBUS" is the only valid word with this dictionary and Wordle confirms it's the correct solution.

Let's look at each piece in this regular expression, just to make sure we know what it's doing:

grep -P: Use Perl Compatible Regular Expressions with GNU grep.
'^: Anchor the match at the beginning of the line.
(?=.*r.*): Match "r" anywhere in the word.
(?=.*e.*): Match "e" anywhere in the word.
(?=.*u.*): Match "u" anywhere in the word.
(?=[a-z]{5}): Match five lowercase alphabetic characters only.
(?!.*(.).*\1): Do not match words with repeated characters.
[^staeoug]: The first character cannot contain "s", "t", "a", "e", "o", "u", or "g".
[^tauorg]: The second character cannot contain "t", "a", "u", "o", "r", or "g".
[^tarog]: The third character cannot contain "t", "a", "r", "o", or "g".
[^taroeg]: The fourth character cannot contain "t", "a", "r", "o", "e", or "g".
s: The fifth character must be "s".
$': Anchor the match at the end of the line.

Yes, it's ugly. Yes, it's probably more efficient to pipe several times. Yes, this regular expression is error-prone and could be difficult to debug. BUT, you learned lookarounds which may come handy for pattern matching in the future unrelated to Wordle.

Next step is discovering whether or not you could store the gray characters (not in the solution) in a capture group and refer back to them in the character classes. That would help to reduce typing and error. I'll leave that as an exercise for the reader (and author).

Checksums in Passwords? Uh, okay.

Aaron Toponce — Fri, 09 Apr 2021 14:51:34 +0000

Introduction

As most of my readers know, I have a rather extensive yet easy-to-use web-based password generator. I've spent a lot of time doing password research (a couple ideas mine, most not), and have implemented most of these into the project. These include, but are not limited to:

Expansive language support
Verbal unambiguity
Visual unambiguity
Memorability
Compact density
Programmatic prediction
Versatility
Accommodating complex requirements
Entertainment
Checksums

That last idea was just recently committed to the project, and I think it might have some value, albeit with some tight controls and possibly little reward for the cost.

Bubble Babble

When I started developing my password generator, Tony Arcieri suggested on IRC that I implement Bubble Babble. I think he meant it mostly as a joke, but already being somewhat familiar with it having to do with SSH keys, I looked into it. Bubble Babble is an encoding specification designed for SSH fingerprints. The goal is to make the fingerprints pronounceable, such as when comparing host keys on first use. Antti Huima designed the specification, and built in a checksum to detect transmission errors. But for a password generator, I initially ignored implementing the checksum, and just implemented "xVCVC-CVCVC-...-CVCVC-CVCVx", where "C" is a random consonant from the spec and "V" is a random vowel also from the spec.

But then I got thinking, would it really be that big of a deal to implement Bubble Babble's checksum? Would it impact the character count for similar security margins? Would users even notice or care? It seemed obvious that the answer was to try it and see, so try it I did. For example, here is a Bubble Babble password before I implemented the checksum: "xuduh-taren-rezyd-gefik-bixux", and here's one after implementing the checksum: "xuval-zoder-cykeh-lyvin-lyrax". The former has approximately 78 bits of security, but the check will fail, while the latter has 72 with a valid check. Both are five pronounceable pseudowords.

However, the check isn't easily identifiable, as it's integrated throughout the entire string, and calculating the check is rather involved. One noticeable identifier is if Bubble Babble is encoding an odd number of bytes versus an even number of bytes. If the number of bytes is even, then the format of the string will have a third "x": "xVCVC-CVCVC-...-CVCVC-CVxVx". That doesn't mean that every even-numbered byte-encoded Bubble Babble string that has three "x"s has a valid check, but even numbered bytes with a valid check will have three "x"s. Regardless, stripping off the checksum isn't really a thing, due to being tightly integrated with the full string.

Okay. Now that it's implemented in Bubble Babble, is there any practical value to it? I could only think of one possible scenario.

A Scenario With Tightly Controlled Authentication

Suppose an organization has a credential management system (CMS) that requires employees to use their built in password manager and password generator. The password generator generates Bubble Babble passwords with valid checksums. When a new employee is hired and their account is setup, the CMS generates a random Bubble Babble password, hashes it, and stores it on disk for authentication. If at any point the employee wants to change their password, the CMS prevents them from supplying their own password, and they must use the builtin generator.

When staff authenticate, all client-side software checks the password for a valid checksum before sending it to the authentication server. If the check fails, the user has entered their password incorrectly, and must try again. If the check succeeds, the software sends the password to the authentication servers for hashing and verification.

The employee could attempt to bypass the client-side check by just sending the password to the authentication server directly, but it would be pointless as if verifying the password hash still fails at the authentication server, the employee still has to retype their password.

Okay, but why? Well, assuming the organization is using a best practice password hashing function with an appropriate cost factor, then authentication is expensive. People frequently mistype passwords and that cost on the server can be mitigated with client-side checksum validation.

But shouldn't users just copy-paste their passwords from the password manager? Absolutely yes they should. The most secure password is the one you don't know. However, there may be scenarios where pasting the password out of the CMS isn't practical, such as hooking up a crash cart to an unresponsive server or logging into your workstation when first getting into the office.

More Checksums

So if there is value here, are there other places where I've implemented a binary-to-text encoding scheme as a password generator that has a formally defined checksum in its specification? Yes, Crockford's Base32 and Bitcoin's BIP39.

In the case of Crockford's Base32, the checksum extends the base-32 character set to 37 characters, and the checksum is calculated modulo 37 against the bytes. It's rather trivial.

In the case of Bitcoin's BIP39, the bytes (which must be a multiple of 4 bytes) are hashed with SHA-256, and the leading bits of the digest are appended to the original entropy bytes to make the final bitstring a multiple of 11 bits, which is then converted to words and presented as a mnemonic, the final word being the "check word".

Screenshots

Below are some screenshots of the current state of affairs with the Bitcoin, Bubble Babble, and Base32 generators when at least 70 bits of security is required. The styling and such in each container might change as the project matures, such as "Integrated checksum" in the lower right-hand corner, but the checksum will remain.

Prior Work - Letterblock Diceware

While I did start thinking of this independently on my own, there is prior work that should be acknowledged. I just discovered it last night when doing web searches for password and passphrase generators with checksums. It was partly that discovery actually that lead to the creation of this post.

On August 12, 2020 Arne Babenhauserheide created Letterblock Diceware as an approach to physically and practically carrying Diceware with you. Unfortunately, Diceware ships 7,776 words with indices, and at best, this is several pages of printed paper with 5 dice, which isn't practical to carry around. So he created a 6x6 table of "pronounceable" and "memorable" digits, letters, and bigrams that can fit on a business card. Roll 2d6, one for the row the other for the column, and record the intersection for your password character. Four 2d6 rolls create a "block" worth about 20 bits of security. Four blocks produces about 80 bits of security as a result.

However, as a weak checksum, add the row numbers of two consecutive blocks modulo 4, and insert the resulting character between the two blocks. For example, if you rolled for two blocks as {col, row}:

{2,1}: A
{6,3}: t
{2,1}: A
{3,5}: U
{1,4}: 48
{2,1}: A
{2,4}: FK
{3,3}: N

Then your blocks are "AtAU" and "48AFKN" (or "4AFN" if you prefer). The rows however are "1, 3, 1, 5, 4, 1, 4, & 3". Adding these up modulo 4 returns (1+3+1+5+4+1+4+3) % 4 = 2, which yields "-" for the check. Thus, the resulting password would be "AtAU-4AFN" (I would have done modulo 6 instead. Then the check is uniform, and it could be printed as one more column on the card).

He also mentions the same scenario that I did in this post as well (emphasis mine):

Letterblock passwords use 55 letters that are unambiguous in handwriting and safe to use in URLs, grouped in blocks of four letters to make them easier to remember, with separators that work as weak checksum to catch many typing errors before even sending the password to the server, with weak optimization for legibility by creating 8 passwords and choosing the one with bigrams that are closest to regular prose.

It's worth noting that upstream Diceware also ships tables to be used with dice, although they're not designed for memorability.

Conclusion

There you have it. Three password generators in my web-based password generation project that now ship with checksums without reducing security or reducing the end user experience. I haven't made an actual release yet, as there is a bit more work I want to do prior to that. However, I'm sure there are other scenarios where passwords with checksums have value, as authentication is ubiquitous, and I couldn't possibly list every possible authentication scenario. Play around with it and let me know what you think. I would be very interested in your feedback.

Introducing Deckware - A 224-bit entropy extractor

Aaron Toponce — Fri, 19 Feb 2021 06:13:05 +0000

Introduction

I can't believe that it's been almost 3 years since my last blog post. Interestingly enough, that was on a deterministic card shuffle that I decided to call "Ouroboros". Well, this post is also about a deterministic algorithm with a deck of playing cards, but rather than shuffling the deck, we'll be extracting the entropy out of it.

The algorithm is called Deckware. I would have called it "Pokerware", but it was already taken by Chris Wellons. I could have called it "Solitaireware", and it does have a sort of ring to it, but I didn't want to confuse people with the Solitaire Cipher by Bruce Schneier. I debated calling it "Bridgeware", but I fear that the Bridge card game is a fringe game enjoyed only by old ladies in nursing homes drinking lemonade, and most people wouldn't get it. Ultimately, the randomness extractor is working through the whole deck, so it makes sense to call it "Deckware", even if it does sound a bit like a construction company.

The thing to understand about this algorithm however, is that it is not a generic passphrase generator like Diceware or Pokerware. As such, there is no word list provided with Deckware. Instead, it's a randomness extractor. it's designed such that you use your deck of playing cards as a random number generator, and this algorithm uniformly returns a 224-bit random number from that shuffle. Once you have that 224-bits of entropy, it's yours to do with as you wish:

Use it as a <= 224-bit cryptographic symmetric key.
Use it as a seed for a CSPRNG, such as reseeding your kernel RNG.
Use it for election auditing, lottery drawing, or randomized drug samples.
Convert the hexadecimal to a 14-word Niceware passphrase.

When you need a lot of randomness, Deckware might work, although it's not particularly fast.

Lehmer Code

The basis of Deckware is Lehmer code. Lehmer code is a factoradic algorithm for converting any specific permutation in a set to an integer. To understand how this works, let's look first at standard combinations that we're all familiar with.

In decimal, which we use every day, we're all familiar with the "ones" place, the "tens" place, "hundreds" place, "thousands" place, etc. So a number like "3481" is 3*1000 + 4*100 + 8*10 + 1*1, right? Simple enough.

Factoradic systems are a way to represent an integer as the sum of multiples of factorials. Instead of a decimal number system (or binay, octal, hexadecimal, etc), it's a factorial number system. If I wanted to take my previous example of "3481", I know that 6! = 720, so 7! = 5040. Thus, 3481/6! = 4 remainder 601. 601/5! is 5 remainder 1. Thus, 3481 = 4*6! + 5*5! + 1*1!.

Okay, but how do you do that with a permutation? Let's say we have a box with numbered chits 1, 2, & 3. How many permutations (order matters) are there? Well, we know it's 3! = 6. We could list them all quite easily:

1, 2, 3
1, 3, 2
2, 1, 3
2, 3, 1
3, 1, 2
3, 2, 1

Lehmer code converts each unique sequence to an integer. It does this by starting with the left-most value, and counting the values less than it to its right. So, starting with the first permutation of "1, 2, 3", "1" is our left-most value, and no values to its rights that are less than 1. So, for this factorial, it's multiplier would be "0". Next we move to the second value, which is "2". Again, there are no values to its right that are less than 2. So also for this factorial, its multiplier is also "0". Finally, on the last value, there are no values to its right, so it's value is "0". This is always the case for the right-most value in Lehmer code. So for "1, 2, 3", our Lehmer code would be 0*2! + 0*1! + 0*0! = 0. If we look at our second permutation of "1, 3, 2", applying Lehmer code, we get 0*2! + 1*1! + 0*0! = 1.

Let's complete the list:

1, 2, 3 = 0*2! + 0*1! + 0*0! = 0
1, 3, 2 = 0*2! + 1*1! + 0*0! = 1
2, 1, 3 = 1*2! + 0*0! + 0*0! = 2
2, 3, 1 = 1*2! + 1*1! + 0*0! = 3
3, 1, 2 = 2*2! + 0*1! + 0*0! = 4
3, 2, 1 = 2*2! + 1*1! + 0*0! = 5

Deckware uses Lehmer code, but with 52! permutations instead of 3! like our example above.

Playing Card Permutations

Knowing that there are 52 unique cards in a standard Poker or Bridge deck of playing cards, then we know there are 52! order permutations. 52! has 68 decimal digits. Converting to binary bits yields log2(52!) ~= 225.581. In case you forgot, it would take all the energy from a hypernova captured by a Dyson sphere to count from 0 to ~2^227. In all likelihood, a sufficiently shuffled deck has never been discovered before.

But how do you do the math? How do you compare the inequality of the Ace of Spades to the Ten of Diamonds, for example? To do this, we need to make some numerical assignments. We're going to use Bridge order for suits, and treat Ace as low, King as high. As such, we get:

Clubs: Ace - King = 1 - 13
Diamonds: Ace - King = 14 - 26
Hearts: Ace - King = 27 - 39
Spades: Ace - King = 40 - 52

Now that we have these numerical assignments, we can trivially do our inequality comparisons to build our Lehmer code. But we have a snag. Because the permutation space is larger than 225 bits but not quite 226 bits, we can't use the full space, or we'll end up with a biased extractor. As such, we need to discard anything larger than 2^225-1 (because we start counting with 0). So, when we compute our Lehmer code, if the value is 2^225 or greater, it's ignored, and the user needs to reshuffle the deck. Otherwise, we return the lower 224 bits of the extracted result to the user.

However, 2^225 is approximately 67% of 2^log2(52!). This means that on average, you will have to reshuffle the deck 33% of the time to prevent getting a biased result, or about 1 out of every 3 shuffles will be discarded. It's really unfortunate that it couldn't be better, but it is what it is.

Deckware versus Pokerware: FIGHT!

I think it's worth mentioning how Deckware compares to Pokerware and when you would want to choose one over the other, seeing as though they are both using a deck of playing cards as a source of randomness.

First off, as already mentioned, Deckware does not ship a word list. Technically speaking, Deckware is not a passphrase generator. It's an entropy extractor. This means that you need to bring your own word list to the Deckware table. By comparison, Pokerware provides both formal and slang word lists as part of the project.

Second, Pokerware can be executed trivially without any computing or calculating device. All you need is a deck of cards and a printed off indexed word list. To be fair, I don't think anyone actually keeps a printed off word list of Pokerware, or Diceware for that matter, with them, except for maybe the inventors themselves. I'm guessing most, if not all, are using a computer to generate the Diceware or Pokerware passphrase.

Deckware on the other hand could be executed 100% with a pencil and paper, but it would be painful and incredibly slow. That's something you would make inmates in prison do when they need something to do. I mean, this is essentially what it would take:

Count inequalities for every card placement in the list.
Find the Lehmer code using the factorial number system.
Convert to base 16.

Yeah, no, I'll pass. I'll stick with the tool. However, Deckware has a couple of advantages over Pokerware though that might be worth considering.

First with Pokerware, after every draw, the deck needs to be reshuffled. As determined, this is at least 7 shuffles. That's 7 full deck shuffles for every passphrase word. At 6 words, that's a total of 42 shuffles you've performed on the deck. Deckware only requires 7 shuffles, and odds are 2 out of 3 you won't need to try again.

Second, for those 42 shuffles with Pokerware, you only returned 74 bits of security. For Deckware's 7 shuffles, you were able to extract 224 bits! That's a significant return for the cost, making it far more efficient.

In summary:

Pokerware:

Advantages:
Provides two word lists.
Simple and clean to execute.
Can be executed without a computer.
Stands on its own as a unique tool.
Disadvantages:
Cumbersome shuffling per generated word.
More time costly for similar security margins.

Deckware:

Advantages:
Maximizes deck entropy.
Small time commitment.
Can be used for security solutions other than passphrases.
Disadvantages:
Does not provide a word list.
Might be difficult to independently audit.
Requires a computer.
Can be replaced with SHA-224.

I think that last disadvantage actually speaks volumes. In the past, I would shuffle the deck, record the results, and hash it with SHA-224. That's perfectly acceptable, and I won't blame you for that approach. Even though using SHA-224 to hash your deck order is technically biased, the bias isn't significant enough to reduce security in practical terms, and so long as SHA-2 remains secure, you can't identify a biased result from an unbiased one.

Deckware is elegant in that not only is it uniform, it doesn't rely on any cryptographic primitives. It's just factorial math. This means you can trivially audit it for correctness. For example, extract the 224-bit hexadecimal string from an ordered deck, and it should return 0x00000000000000000000000000000000000000000000000000000000. Swap the King of Spades with the Queen of Spades, and it will return 0x00000000000000000000000000000000000000000000000000000001. This sort of vetting isn't accessible for SHA-2, although there is little to no reason to not trust its correctness.

I'm not going to say one is better than the other (Pokerware or Deckware), because as outlined, they have their own strengths and weaknesses. I have personally used Pokerware, and truth be told, I was adding it to my password and passphrase generator (how did I miss it?!), and it got me thinking: "how would I design a playing card algorithm without relying on cryptography?"

Deckware In Action

Here's a couple screenshots of an early release of the tool in action. Here, you can see the unshuffled deck on the "upper table". The suit symbols are emoji provided by OpenMoji. I added the text next to each suit using the DejaVu Serif font in Inkscape.

Here, I've dragged and dropped each card onto the lower table representing the shuffled deck. I'm not very good an JavaScript listening events, so I shamelessly took the code from W3Schools. No doubt it could use some polish, but it works.

Notice that I've clicked the "Calculate unique deck ID" button to extract the entropy (maybe I should change that button text now that I'm thinking about it). I got "b08bd2f0720ade917b842ee1e721fe1c6ad00429e1155f9201b50d82" returned. This gives be a 14-word Niceware passphrase of "random sporran ironclad tare lifeful cromwell trekked wrigglier imprudence amenable thai hajj affectionately barratry".

After extracting the entropy out of the deck, you should thoroughly reshuffle the deck or place it back in order to destroy the key, so the entropy cannot be re-extracted. You should also reload your web browser for the same reason. The tool is not using any persistent storage, but feel free to run the tool in a private browser window if you're paranoid.

Closing Thoughts

In practice, after shuffling the deck, I was able to record every card in the tool in 153 seconds, or around 2 - 3 minutes. That's not bad with drag-and-drop using the mouse, and I'm sure it can be improved with a keyboard listening event to type it in rather than using the mouse. Again though, I'm not proficient with JavaScript listening events, so maybe someone can help me out here. However, this tool or SHA-224, the bulk of the time is taken to record the cards in the shuffled deck, so from my point of view, it's sixes. Pick your poison. You need a tool, one way or the other.

For the time being, I've got this opened in a tab in my browser. When I need a password generator, I give the hexadecimal string to Niceware. 14 words is generally overkill for my usual password needs. Even dividing it in half at two 7 words each, gives me two 112-bit passphrases. Still overkill. But 5 words yields 80 bits of security, which is right on the money. I can get two 80-bit passphrases and one 64-bit passphrase out of a single shuffle. I'll have to see how this goes.

The Ouroboros Card Shuffle

Aaron Toponce — Fri, 05 Oct 2018 13:42:55 +0000

Introduction

For the most part, I don't play a lot of table games, and I don't play party games. But occasionally, I'll sit down with my family and play a board game or card game. When we play a card game though, I get teased by how I shuffle the deck of cards.

I know that to maximize entropy in the deck, it should be riffle shuffled at least 7 times, but a better security margin would be around 10 shuffles, and after 12, I'm just wasting my time. But I don't always riffle shuffle. I'll also do various deterministic shuffles as well, such as the pile shuffle to separate cards from each other.

I'm familiar with a number of deterministic card shuffles, which I didn't even know had names:

Pile shuffle- Separate the cards into piles, one at a time, until exhausted, then collect the piles. I usually just do 4 or 5 piles, and pick up in order.

Mongean shuffle- Move the cards from one had to the other, strictly alternating placing each discard on top then beneath the previously discarded cards.

Mexican spiral shuffle- Discard the top card on the table, and the second card to the bottom of the deck in your hard. Continue discarding all odd cards to the table, all even cards beneath the deck in hand until exhausted. I never do this shuffle, because it takes too long to execute.

In practice, when I'm playing a card game with my family, I'll do something like 3 riffle shuffles, a pile shuffle, 3 more riffle shuffles, a Mongean shuffle, 3 more riffle shuffles, another pile shuffle, then one last riffle shuffle. I'll get teased about it, of course: "Dad, I'm sure the cards are shuffled just fine. Can we just play now?", but when we play the game, I'll never hear complaints about how poorly the deck was shuffled.

This got me thinking though- there aren't that many simple deterministic blind card shuffles (I say "blind", because any of the playing card ciphers would work, but that requires seeing the cards, which is generally frowned upon when playing competitive card games). I wonder what else is out there. Well, doing some web searches didn't turn out much. In fact, all I could find were variations of the above shuffles, such as random pile discarding and pickup, but nothing new.

So the question then turned into- could I create my own simple deterministic card shuffle? It didn't take me long before I came up with what I call the "Ouroboros shuffle".

The Ouroboros Shuffle

Before going any further, let me state that I very much doubt I'm the first to come up with this idea, but I have searched and couldn't find where anyone else had documented it. If it does in fact exist, let me know, and I'll gladly give credit where credit is due. Until then, however, I'll stick with calling it the "Ouroboros Shuffle", named after the serpent or dragon eating its own tail.

The shuffle is simple:

Holding the deck in your hard, discard the first card from the bottom of the deck to the table.

Discard the top card of the deck to the discard pile on the table.

Repeat steps 1 and 2, strictly alternating bottom and top cards until the deck is exhausted.

If the playing cards are plastic-based, like those from Kem or Copag, then you could "pinch" the top and bottom cards simultaneously, and pull them out of the deck in your hand to the tale. If you do this perfectly, you will pinch 2 cards only 26 times. If they're paper-based though, this may or may not work as efficiently due to cards having a tendency to stick together after heavy use.

If the deck was unshuffled as "1, 2, 3, ..., 50, 51, 52", then the first shuffle would look like this:

Step: 0 Unshuffled: 1, 2, 3, ..., 50, 51, 52 Shuffled: Step: 1 Unshuffled: 1, 2, 3, ..., 49, 50, 51 Shuffled: 52 Step: 2 Unshuffled: 2, 3, 4, ..., 49, 50, 51 Shuffled: 1, 52 Step: 3 Unshuffled: 2, 3, 4, ..., 48, 49, 50 Shuffled: 51, 1, 52 Step: 4 Unshuffled: 3, 4, 5, ..., 48, 49, 50 Shuffled: 2, 51, 1, 52 Step: 5 Unshuffled: 3, 4, 5, ..., 47, 48, 49 Shuffled: 50, 2, 51, 1, 52 Step: 6 Unshuffled: 4, 5, 6, ..., 47, 48, 49 Shuffled: 3, 50, 2, 51, 1, 52 .... Step 50: Unshuffled: 26, 27 Shuffled: 25, 28, 24, ..., 51, 1, 52 Step 51: Unshuffled: 26 Shuffled: 27, 25, 28, ..., 51, 1, 52 Step 52: Unshuffled: Shuffled: 26, 27, 25, ..., 51, 1, 52

As you can see, the top and bottom cards are always paired together in the unshuffled deck, and discarded as a pair to the shuffled deck. The top and bottom cards could also be thought of as the head and tail of a list, and thus why I called it the Ouroboros shuffle.

If you execute this algorithm perfectly from an unshuffled deck, it will take 51 rounds to before restoring the deck to its unshuffled state.

Observations

Almost immediately, I noticed a bias. It doesn't matter how many times I execute this algorithm, the bottom card will always remain on the bottom. In the above example, the King of Spades (if assigned the value of "52") will stay at the bottom of the deck, due to the nature of the shuffle of discarding the bottom card first. So I recognized that I would need to cut at least 1 card from the top of the deck to the bottom of the deck before the next round of the shuffle, to ensure the bottom card gets mixed in with the rest of the deck.

Other questions started popping up, specifically:

How many perfect shuffles will it take to restore the deck to an unshuffled state now?

Is there a different bias hidden after cutting the top card to the bottom?

What if I cut 2 cards? 3 cards? 51 cards?

Whelp, time to code up some Python, and see what pops out. What I'm looking for is what the state of the deck looks like after each round. In other words, I want to know which card occupies which positions in the deck. For example, does the Seven of Clubs see all possible 52 positions in the deck? Without the cut, we know that's not possible, because the bottom card stubbornly stays in the bottom position.

Typing up a quick script and graphing with Gnuplot gave me the following images. The first image on the left is the Ouroboros shuffle with no cuts, where the right image is the Ouroboros shuffle followed by cutting the top card to the bottom of the deck as the end of the round. Click to enlarge.

What you're looking at is the card position in the deck along the X-axis and the card value along the Y-axis. In the left image, where the Ouroboros shuffle is executed without any following cuts, the 52nd card in the deck is always the face value of 52. But in the right image, where the Ouroboros shuffle is followed by cutting one card from the top of the unshuffled deck to the bottom, every card position sees every face value.

So what would happen if instead of cutting 1 card off the top to the bottom at the end of each round, I cut 2 cards, and cards, etc. all the way to cutting 51 cards off the top to the bottom? Well, more Python scripting, and I generated a total of 52 images showing every possible position a card occupies in the deck until the deck returns to its unshuffled state.

Visualizations of the Ouroboros card shuffle with cuts

Interestingly enough, executing the Ouroboros shuffle followed by cutting 19 cards, leads to a cycle length of 6,090 perfect shuffles before restoring the deck back to its unshuffled state. Awesome! Except, as you can see in the Imgur post above, it's extremely biased.

Every shuffle-and-cut is listed here with its cycle length:

Cut: 0, iter: 51 Cut: 1, iter: 52 Cut: 2, iter: 51 Cut: 3, iter: 272 Cut: 4, iter: 168 Cut: 5, iter: 210 Cut: 6, iter: 217 Cut: 7, iter: 52 Cut: 8, iter: 418 Cut: 9, iter: 52 Cut: 10, iter: 24 Cut: 11, iter: 350 Cut: 12, iter: 387 Cut: 13, iter: 252 Cut: 14, iter: 1020 Cut: 15, iter: 144 Cut: 16, iter: 1972 Cut: 17, iter: 34 Cut: 18, iter: 651 Cut: 19, iter: 6090 Cut: 20, iter: 175 Cut: 21, iter: 90 Cut: 22, iter: 235 Cut: 23, iter: 60 Cut: 24, iter: 2002 Cut: 25, iter: 144 Cut: 26, iter: 12 Cut: 27, iter: 50 Cut: 28, iter: 24 Cut: 29, iter: 10 Cut: 30, iter: 44 Cut: 31, iter: 72 Cut: 32, iter: 297 Cut: 33, iter: 90 Cut: 34, iter: 45 Cut: 35, iter: 132 Cut: 36, iter: 12 Cut: 37, iter: 210 Cut: 38, iter: 207 Cut: 39, iter: 104 Cut: 40, iter: 420 Cut: 41, iter: 348 Cut: 42, iter: 30 Cut: 43, iter: 198 Cut: 44, iter: 35 Cut: 45, iter: 140 Cut: 46, iter: 390 Cut: 47, iter: 246 Cut: 48, iter: 28 Cut: 49, iter: 12 Cut: 50, iter: 36 Cut: 51, iter: 30

The only "shuffle then cut" rounds that are uniform appear to be cutting 1 card, 7 cards, and 9 cards. The other 49 shuffles are biased in one way or another, even if each of them have different cycle lengths.

Here's the Python code I used to create the shuffled lists, each into their own comma-separated file:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#!/usr/bin/python

def step_1(deck):
tmp = []
for card in range(26):
tmp.insert(0, deck[-1])
deck.pop(-1)
tmp.insert(0, deck[0])
deck.pop(0)
return tmp

def step_2(deck, cut):
return deck[cut:] + deck[:cut]

orig = [_ for _ in range(1, 53)]
deck = [_ for _ in range(1, 53)]

for i in range(52):
with open("cut_{}.csv".format(i), "w") as f:
f.write(",".join(map(str, deck)) + "\n")
deck = step_1(deck)
deck = step_2(deck, i)
n = 1
while deck != orig:
with open("cut_{}.csv".format(i), "a") as f:
f.write(",".join(map(str, deck)) + "\n")
deck = step_1(deck)
deck = step_2(deck, i)
n += 1
print "Cut: {}, iter: {}".format(i, n)

Conclusion

This was a fun shuffle to play with and one that I'll incorporate into my card playing with my family. Now I could do something like: 3 riffle shuffles, Ouroboros shuffle with 1 card cut, 3 riffle shuffles, Mongean shuffle, 3 riffle shuffles, pile shuffle, 1 riffle shuffle, and I will be happy.

Latin Squares, Mathematics, and Cryptography

Aaron Toponce — Thu, 20 Sep 2018 16:11:25 +0000

Introduction

Recently, I've been studying Latin squares and their role in classical cryptography including the one-time pad. Latin squares are NxN squares where no element in a row is duplicated in that same row, and no element in a column is duplicated in that column. The popular Sudoku game is a puzzle that requires building a Latin square.

As I delved deeper and deeper into the subject, I realized that there is a rich history here that I would like to introduce you to. Granted, this post is not an expansive nor exhaustive discussion on Latin squares. Rather, it's meant to introduce you to the topic, so you can look into it on your own if this interests you.

In each of the sections below, the "A" and "N" characters are highlighted in the table image to demonstrate that the table is indeed a Latin square. Further, you can click on any table image to enlarge.

Tabula Recta

The Tabula Recta is the table probably most are familiar with, and recognize it as the VigenÃ¨re table. However, the table was first used by German author and monk Johannes Trithemius in 1508, which it was used in his Trithemius polyalphabetic cipher. This was a good 15 years before Blaise de VigenÃ¨re was even born, 43 years before Giovan Battista Bellaso wrote about his cipher using the table in his 1553 book "La cifra del. Sig. Giovan Battista Bellaso", and 78 years before Blaise de VigenÃ¨re improved upon Bellaso's cipher.

Today, we know it as either the "tabula recta" or the "VigenÃ¨re table". Regardless, each row shifts the alphabet one character to the left, creating a series of 26 Caesar cipher shifts. This property of the shifted alphabets turns out to be a weakness with the VigenÃ¨re cipher, in that if a key repeats, we can take advantage of the Caesar shifts to discover the key length, then the key, then finally breaking the ciphertext.

Jim Sandborn integrated a keyed tabula recta into his Kryptos sculpture in the 2nd and 4th panels. Even though the first 3 passages in the Kryptos sculpture have been cracked, the 4th passage remains a mystery.

Beaufort Table

More than 250 years later, Rear Admiral Sir Francis Beaufort modified the VigenÃ¨re cipher by using a reciprocal alphabet and changing the way messages were encrypted. Messages were still encrypted with a repeating key, similar to the VigenÃ¨re cipher, but plaintext character was located in the first column and the key in the first row. The intersection was the ciphertext. This became the Beaufort cipher.

His reasoning in why he used a different table and changed the enciphering process isn't clear. It may have been as simple as knowing concepts about the VigenÃ¨re cipher without knowing the specific details. He may have had other reasons.

One thing to note, however, is that VigenÃ¨re-encrypted ciphertexts cannot be decrypted with a Beaufort table and vice versa. Even though the Beaufort cipher suffers from the same cryptanalysis, the Caesar shifts are different, and the calculation if using numbers instead of letters is also different.

The Beaufort table was integrated into a hardware encryption machine called the Hagelin M-209. The M-209 was used by the United States military during WWII and through the Korean War. The machine itself was small, and compact, coming in about the size of a lunchbox and only weighing 6 pounds, which was remarkable for the time.

One thing to note, is that the Beaufort table has "Z" in the upper-left corner, with the reciprocal alphabet in the first row and first column, as shown in the image above. Any other table that is not exactly as shown above that claims to be the Beaufort table is not correct.

NSA's DIANA Reciprocal Table

Of course, the narcissistic NSA needs their own polyalphabetic table! We can't let everyone else be the only ones who have tables! I'm joking of course, as there is a strong argument for using this reciprocal table rather than the Beaufort.

Everyone is familiar with the one-time pad, a proven theoretically unbreakable cipher if used correctly. There are a few ways in which to use the one-time pad, such as using XOR or modular addition and subtraction. Another approach is to use a lookup table. The biggest problem with the tabula recta is when using the one-time pad by hand, it's easy to lookup the wrong row or column and introduce mistakes into the enciphering process.

However, due to the reciprocal properties of the "DIANA" table (don't you love all the NSA codenames?), encryption and decryption are identical, which means they only require only a single column. A key "row" is no longer needed, and the order of plain, key and cipher letter don't matter (VigenÃ¨re vs Beaufort) and may even differ for sender and receiver. Just like with Beaufort, this table is incompatible with VigenÃ¨re-encrypted ciphertexts. Further, it's also incompatible with Beaufort-encrypted ciphertexts, especially if it's a one-time pad. The Beaufort table shifts the alphabet to the right, while the DIANA table shifts the alphabet to the left. The tabula recta also shifts left.

Let's make one thing clear here- this table was created strictly for ease of use, not for increased security. When using the one-time pad, the key is at least the length of the message, which means it doesn't repeat. So it doesn't matter that the table is 26 Caesar-shifted alphabets. That property won't show itself in one-time pad ciphertexts.

E.J. Williams' Balanced Tables

Stepping away from cryptography for a moment, and entering the world of mathematics, and in this case, mathematical models applied to farming, we come across E.J. Williams' balanced tables. Note how the "A" and "N" characters are populated throughout the table compared to what we've seen previously.

The paper is modeling chemical treatments to crops over a span of time, and how to approach the most efficient means of applying those treatments. The effects of the previous treatment, called the "residual effect" is then analyzed. A method based on a "balanced" Latin square is discussed. It is then applied to multiple farming sites and analyzed.

Now, I know what you're thinking- "Let's use this for a cipher table!". Well, if you did, and your key repeated throughout the message, the ciphertext would not exhibit Caesar-shifted characteristics like VigenÃ¨re and Beaufort. However, the table is still deterministic, and as such, knowing how the table is built will give cryptanalysts the edge necessary to still break Williams-encrypted ciphertexts.

Michael Damm's Anti-Symmetric Quasigroups of Order 26

Also in the world of mathematics are quasigroups. These are group algebras that must be both totalitive and invertible, but not necessarily associative. Michael Damm researched quasigroups as the basis for an integrity checksum, such as in calculating the last digit of a credit card number. But, not only did he research quasigroups, but anti-symmetric quasigroups. Anti-symmetry is a set algebra concept. If "(c*x)*y = (c*y)*x", then this implies that "x = y", and thus the set is symmetric. An anti-symmetric set means "(c*x)*y != (c*y)*x", and as such, "x != y".

Michael Damm, while researching checksums, introduced us to anti-symmetric quasigroups. One property was required, and that was that the main diagonal was "0", or "A" in our case. The Damm algorithm creates a checksum, such that when verifying the check digit, the result places you on the main diagonal, and thus returns "0". Note that any quasigroup can be represented by a Latin square.

Due to the nature of the Damm algorithm as a checksum, this could be used to verify the integrity of a plaintext message before encrypting using a quasigroup of order 26, as shown above. The sender could calculate the checksum of his plaintext message, and append the final character to the plaintext before encrypting. The recipient, after decrypting the message, could then run the same Damm checksum algorithm against the full plaintext message. If the result is "A", the message wasn't modified.

Notice in my image above, that while "A" rests along the main diagonal, the rest of the alphabets are randomized, or at least shuffled. It really isn't important how the alphabets are created, so long as they meet the requirements of being an anti-symmetric quasigroup.

Random Tables

Finally, we have randomized Latin squares. These are still Latin squares, such that for any element in a row, it is not duplicated in that row, and for any element in a column, it is not duplicated in that column. Other than that, however, there is no relationship between rows, columns, or elements. Their use is interesting in a few areas.

First, suppose I give you a partially filled Latin square as a "public key", with instructions on how to encrypt with it. I could then use my fully filled Latin square "private key", of which the public is a subset of. Using this private key, with some other algorithm, I could then decrypt your message. It turns out, filling in a partially-filled Latin square is NP-complete, meaning that we don't know of any polynomial-time algorithm currently can can complete the task. As such, this builds a good foundation for public key cryptography, as briefly outlined here.

Further, because of the lack of any structure in a randomized Latin square, aside from the requirements of being a Latin square, these make good candidates for symmetric message authentication code (MAC) designs. For example, a question on the cryptography StackExchange asked if there was any humanly-verifiable way to add message authentication to the one-time pad. The best answer suggested using a circular buffer as a signature, which incorporates the key, the plaintext, modular addition, and the Latin square. By having a randomized Latin square as the foundation for a MAC tag, no structure is present in the authenticated signature itself. Note, the table can still be public.

Steve Gibson incorporated Latin squares into a deterministic password manager. Of course, as with all deterministic password managers, there are some fatal flaws in their design. Further, his approach, while "off the grid", is rather cumbersome in execution. But it is creative, and certainly worth mentioning here as a randomized Latin square.

Conclusion

Latin squares have fascinated mathematicians for centuries, and in this post, we have seen their use en cryptography, mathematical modeling, data integrity, message authentication, and even password generation. This only shows briefly their potential.

Getting Up To 8 Possibilities From A Single Coin Toss

Aaron Toponce — Fri, 10 Aug 2018 12:00:50 +0000

Introduction

Lately, I've been interested in pulling up some classical modes of generating randomness. That is, rather than relying on a computer to generate my random numbers for me, which is all to common and easy these days, I wanted to go offline, and generate random numbers the classical way- coin flips, dice rolls, card shuffling, roulette wheels, bingo ball cages, paper shredding, etc.

In fact, if randomness interests you, I recently secured the r/RNG subreddit, where we discuss everything from random number generators, to hashing functions, to quantum mechanics and chaos to randomness extraction. I invite you to hang out with us.

Anyway, I was a bit bothered that all I could get out of a single coin toss was 2 outcomes- heads or tails. It seems like there just is no way around it. It's the most basic randomness mechanic, yet it seems so limiting. Then I came across 18th century mathematician Gearges-Louis Leclerc, Comte de Buffon, where he played a common game of placing bets on tossing a coin onto a tiled floor, and whether or not the coin landed squarely in the tile without crossing any edges, or if the coin did actually cross a tile edge.

Then it hit me- I can extract more entropy out of a coin toss by printing a grid on a piece of paper. So, I put this up on Github as a simple specification. So far, I have papers you can print that are letter, ledger, a3, or a4 sizes, and coins for United States, Canada, and the European Union.

The theory

Assume a tile has a length of "l" and is square, and assume a coin has a diameter "d" such that "d < l". In other words, the tile is larger than the coin, and there is a place on the tile where the coin can sit without crossing any edges of the tile.

This means that if the edge of the coin is tangent to the edge of the tile, then we can draw a smaller square with the center of the coin, inside our tile. This smaller square tells us that if the center of the coin lands anywhere inside of that smaller square, the edges of the coin will not cross the edges of the tile.

Now we know the coin diameter, but we would like to know the tile edge length, so we can draw our grid on a paper to toss the coin to. As such, we need to know the ratio of the area of the tile to the area of the smaller inner square drawn by the coin.

Know that the area of the tile with length "l" is:

A(tile) = l^2

The area of the smaller square inside the tile is determined by both the tile length "l" and the coin diameter "d":

A(inner_square) = (l-d)^2

So the ratio of the two is:

P = (l-d)^2/l^2

I use "P" for our variable as this the probability of where the center of the coin lands. We want our outcomes equally likely, so we want the center of the coin to land with 50% probability inside the inner square and 50% probability outside of the inner square.

1/2 = (l-d)^2/l^2

We know the diameter of the coin, so we just need to solve for the tile edge length. This is a simple quadratic equation:

1/2 = (l-d)^2/l^2 l^2 = 2*(l-d)^2 l^2 = 2*(l^2 - 2*l*d + d^2) l^2 = 2*l^2 - 4*l*d + 2*d^2 0 = l^2 - 4*l*d + 2*d^2

If you remember from your college or high school days, the quadratic formula solution to the quadratic equation is:

x = (-b +/- Sqrt(b^2 - 4*a*c))/(2*a)

Plugging that in, we get:

l = (-(-4*d) +/- Sqrt((4*d)^2 - 4*1*(2*d^2)))/2 l = (4*d +/- Sqrt(16*d^2-8*d^2))/2 l = (4*d +/- Sqrt(8*d^2))/2 l = (2*d +/- 2*d*Sqrt(2))/2 l = d*(2 +/- Sqrt(2))

No surprise, we have 2 solutions. So, which one is the correct one? Well, we set the restriction that "d < l" earlier, and we can't break that. So, we can clearly see that:

l = d*(2 - Sqrt(2)) l = d*(Something less than 1) l < d

So, this equation would mean that our coin diameter "d" is larger than our tile edge "l", which doesn't mean anything to us. As such, the solution to our problem for finding the tile edge length when we know the coin diameter is:

l = d*(2 + Sqrt(2))

Getting 4 outcomes from one coin toss

Now that we have the theory knocked out, let's apply it. I know that a United States penny diameter is 1.905 centimeters. As such, my grid dege length needs to be:

l = 1.905*(2+Sqrt(2)) l ~= 6.504 centimeters

This means then that when I flip my coin onto the paper grid, there is now a 50% chance that the coin will cross a grid line and a 50% chance that it won't. But this result runs orthogonal to whether or not the coin lands on a heads or tails. As such, I have the following uniformly distributed outcomes:

Tails, does not cross an edge = 00

Tails, crosses any edge(s) = 01.

Heads, does not cross an edge = 10.

Heads, crosses any edge(s) = 11.

If we do a visual check, whenever the coin center lands in the white square of the following image, the edges of the coin will not cross the edge of the square. However, if the coin center lands in the red area, then the edge of the coin will cross an edge or edges.

You can measure the ratios of the area of the red to that of the white to convince yourself they're equal, although careful- my image editor didn't allow me to calculate sub-pixel measurements when building this image.

Getting 8 outcomes from one coin toss

Remarkably, rather than treat the grid as a single dimensional object (does it cross any grid edge or not), I can use the grid an an X/Y plane, and treat crossing an x-axis edge differently than crossing a y-axis edge. However, that only gives me 6 outcomes, and it is possible that the coin will land in a corner, crossing both the x-axis and y-axis simultaneously. So, I need to treat that separately.

Just like we calculated the odds of crossing an edge to be 50%, now I have 4 outcomes:

Does not cross an edge.

Crosses an x-axis edge only.

Crosses a y-axis edge only.

Crosses both the x-axis and y-axis edges (corner).

As such, each outcome needs to be equally likely, or 25%. Thus, my problem now becomes:

1/4 = (l-d)^2/l^2

Just like we solved the quadratic equation for P=50%, we will do the same here. However, I'll leave the step-by-step as an exercise for the user. Our solution becomes:

l = 2*d

Simple. The length of the grid edge must by twice the diameter of the coin. No square roots or distributed multiplication. Double the diameter of the coin, and we're good to go. However, this has a benefit and a drawback. The benefit is that the grid is more compact (2*d vs ~3.4*d). This reduces your ability to "aim" for a grid edge or not. The drawback is that you will have to make more ambiguous judgment calls on whether or not the coin crosses an edge (75% of the time vs 50% in the previous approach).

So, like the previous approach of getting 4 outcomes, we have a visual check. If the coin center lands anywhere in the white area of the following image, it won't cross an edge. If the coin center lands anywhere in the green area, the coin will cross an x-axis edge. If the coin center lands anywhere in the blue, it will cross a y-axis edge, and if the coin center lands anywhere in the red, it will cross both an x-axis and y-axis edge simultaneously.

As with the previous 4 outcome approach, we can convince ourselves this holds. The area of the white square should equal the are of the blue, as well as equal the area of green, as well as equal the area of the red.

This means now we have 8 uniformly distributed outcomes:

Tails, does not cross an edge = 000

Tails, crosses the x-axis only = 001

Tails, crosses the y-axis only = 010

Tails, crosses both axes (corner) = 011

Heads, does not cross an edge = 100

Heads, crosses the x-axis only = 101

Heads, crosses the y-axis only = 110

Heads, crosses both axis (corner) = 111

How do you like them apples? 8 equally likely outcomes from a single coin toss.

As mentioned, one big advantage, aside from getting 3-bits per coin toss, is the smaller grid printed on paper. You have a less opportunity to try to cheat the system, and influence the results, but you also may have a more difficult time deciding if the coin crossed a grid edge.

On the left is the paper grid used for 2-bit coin toss extraction, while the paper grid on the right is used for 3-bit coin toss extraction.

Some thoughts

Both fortunately and unfortunately, the coin will bounce when it lands on the paper. If the bounce causes the coin to ounce off the paper, all you can record is a single bit, or one of two outcomes- heads or tails, because it becomes ambiguous on whether or not the coin would have crossed a grid edge, and which edge it would have crossed.

You could probably minimize the bouncing by putting the paper on something less hard than a table, such as a felt card table, a towel or rag, or even a carpeted floor. But this may also create unwanted bias in the result. For example, if the paper is placed on something too soft, such as a carpeted floor, when the coin hits the paper, it could create damage to the paper, thus possibly influencing future flips, and as a result, introducing bias into the system.

Surrounding the paper with "walls", such that the coin cannot pass the boundaries of the grid may work, but the coin bouncing off the walls will also impact the outcomes. It seems clear that the walls would need to be placed immediately against the outer edge of the grid to prevent from introducing any bias.

Additional approaches

This is hardly conclusive on extracting extra randomness from a coin flip. It is known that coins precess when flying through the air, and as part of that precess, the coin may be spinning like a wheel. Which means that the coin could be "facing north" or "facing south" in addition to heads or tails. However, it's not clear to me how much spin exists in a standard flip, and if this could be a reliable source of randomness.

Further, in our 8-outcome approach, we used the center of the coin as the basis for our extra 2 bits, in addition to heads and tails. However, we could have made the grid edge length "4d" instead of "2d", and ignored the toss if the coin crosses both edges. This means a larger grid, which could improve your chances of aiming the coin in an attempt to skew the results, and also means sticking with smaller coins, as larger coins just won't have many grids that can fit on a paper.

Other ideas could be adding color to each grid. So not only do we identify edge crossing, but use color as a third dimension to get up to 4-bits of randomness. So maybe the x-axes have alternating black and white areas, as do the y-axis, and the corners. The center of the grid could be possibly alternating red/blue quadrants. Where the center of the coin lands determines the extracted bits. Of course, this would be a visually busy, and possibly confusing paper.

None of these have been investigated, but I think each could be interesting approaches to how to extract more bits out of a coin toss.

Conclusion

I think this is a refreshing approach to an age-old problem- the coin toss. Extracting 3-bits from a single flip is extracting more entropy than a fair d6 die can produce (~2.5 bits). This means that practically speaking, the coin is more efficient at entropy extraction than dice. However, you can roll multiple dice simultaneously, where it's more difficult to toss multiple coins simultaneously.

Acknowledgments

Thanks to Dr. Markku-Juhani O. Saarinen, Marsh Ray, and JV Roig for the discussions we had on Twitter, and for helping me flush out the ideas.

Middle Square Weyl Sequence PRNG

Aaron Toponce — Mon, 30 Jul 2018 12:00:35 +0000

Introduction

The very first software algorithm to generate random numbers, was supposedly written in 1946 by John von Neumann, and is called the Middle Square Method, and it's crazy simple. Enough so, you could execute it with a pencil, paper, and basic calculator. In this post, I'm going to cover the method, it's drawbacks, and an approach called the Weyl Sequence

Middle Square Method

The algorithm is to start with an n-digit seed. The seed is squared, producing a 2n-digit result, zero-padded as necessary. The middle n-digits are then extracted from the result for the next seed. See? Simple. Let's look at an example.

Suppose my seed is 81 (2 digits). 81-squared is 6561 (4 digits). We then take the middle 2 digits out of the result, which is 56. We continue the process:

81² = 6561 56² = 3136 13² = 0169 16² = 0256 25² = 0625 62² = 3844 84² = 7056 5² = 0025 2² = 0004 0² = 0000

And we've reached the core problem with the middle square method- it has a tendency to converge, most likely to zero, but other numbers are possible, and in some cases a short loop. Of course, John von Neumann was aware of this problem, but he also preferred it that way. When the middle square method fails, it's immediately noticable. But, it's also horribly biased and fails most statistical tests for randomness.

Middle Square Weyl Sequence

A modern approach to an old problem is known as the Middle Square Weyl Sequence, from Hermann Weyl. Basically, a number is added to the square, then the middle bits are extracted from the result for the next seed. Let's first look at the C code, then I'll explain it in detail.

#include uint64_t x = 0, w = 0 // Must be odd (least significant bit is "1"), and upper 64-bits non-zero uint64_t s = 0xb5ad4eceda1ce2a9; // qualifying seed // return 32-bit number inline static uint32_t msws() { x *= x; // square the number w += s; // the weyl sequence x += w; // apply to x return x = (x>>32) | (x<<32); // return the middle 32-bits }

Explanation

Okay. Let's dive into the code. This is a 32-bit PRNG using John von Neumann's Middle Square Method, starting with a 64-bit seed "s". As the notes say, it must be an odd number, and the upper 64-bits must be non-zero. It must be odd, to ensure that "x" can be both odd and even. Recall- an odd plus an odd equals an even, and an odd plus an even equals an odd.

Note that at the start, "x" is zero, so squaring it is also zero. But that's not a problem, because we are adding a non-zero number. During that time, the "w" variable is assigned. It's dynamically changed on every iteration, although "s" remains static.

Finally, our return is a 32-bit number (because of the "inline static uint32_t" function width), but we're doing some bit-shifting. Supposedly, this is returning the middle 32-bits of our 64-bit "x", but that's not immediatly clear. Let's look at it more closely.

Example

Suppose "x = 0xace983fe671dbd09". Then "x" is a 64-bit number with the following bits:

1010110011101001100000111111111001100111000111011011110100001001

When that number is squared, it becomes the 128-bit number 0x74ca9e5f63b6047f6a65456d9da04a51, or in binary:

01110100110010101001111001011111011000111011011000000100011111110110101001100101010001010110110110011101101000000100101001010001

But remember, "x" is a 64-bit number, so in our C code, only the bottom 64-bits are returned from that 128-bit number. So "x" is really 0x6a65456d9da04a51, or in binary:

0110101001100101010001010110110110011101101000000100101001010001

But the bits "01101010011001010100010101101101" are the 3rd 32-bits of the 128-bit number that was the result of squaring "x" (see above). They are the "middle" 32-bits that we're after. So, we're going to do something rather clever. We're going to swap the upper 32-bits with the lower, then return the lower 32-bits.

Effectively, what we're doing is "ABCD" -> "CDAB", then returning "AB". We do this via bit-shifting. So, starting with:

0110101001100101010001010110110110011101101000000100101001010001

First, we bitshift the 64-bit number right 32-bits:

0110101001100101010001010110110110011101101000000100101001010001 >> 32 = 0000000000000000000000000000000001101010011001010100010101101101

Then we bitshift "x" left 32-bits:

0110101001100101010001010110110110011101101000000100101001010001 << 32 = 1001110110100000010010100101000100000000000000000000000000000000

Now we logically "or" them together:

0000000000000000000000000000000001101010011001010100010101101101 | 1001110110100000010010100101000100000000000000000000000000000000 |================================================================ 1001110110100000010010100101000101101010011001010100010101101101

See the swap? Now, due to the function return width, we return the lower 32-bits as our random number, which is 01101010011001010100010101101101, or 1785021805 in decimal. We've arrived at our goal.

Conclusion

At the main website, the C source code is provided, along with 25,000 seeds, as well as C source code for the Big Crush randomness tests from TestU01. This approach passes Big Crush with flying colors on all 25,000 seeds. Something a simple as adding an odd 64-bit number to the square changes John von Neumann's approach so much, it becomes a notable PRNG.

Who said you can't teach an old dog new tricks?

Why The "Multiply and Floor" RNG Method Is Biased

Aaron Toponce — Wed, 13 Jun 2018 17:30:56 +0000

I've been auditing a lot of JavaScript source code lately, and a common problem I'm seeing when generating random numbers is using the naive "multiply-and-floor" method. Because the "Math.random()" function call returns a number between 0 and 1, not including 1 itself, then developers think that the "best practice" for generating a random number is as follows:

1
2
3
function randNumber(range) {
return Math.floor(Math.random() * range); // number in the interval [0, range).
}

The problem with this approach is that it's biased. There are numbers returned that are more likely to occur than others. To understand this, you need to understand that Math.random() is a 32-bit RNG in Chrome and Safari, and a 53-bit RNG in Edge and Firefox. First, let's pretend every browser RNG is a 32-bit generator, then we'll extend it.

A 32-bit Math.random() means that there are only 2³² = 4,294,967,296 possible decimal values in the range of [0, 1). This means that the interval [0, 1) is divided up every "1/2³² = 0.00000000023283064365" decimal values. But that doesn't matter though, because if I wanted a random number between 1 and 100, 100 does not divide 4,294,967,296 evenly. I get 42,949,672 with 96 left over. What does this mean? It means that ...

1
randNumber(100);

... will favor 96 numbers out of our 100. The 4 least likely results are 24, 49, 74, & 99. That's our bias.

It doesn't matter if it's a 53-bit RNG either. "2⁵³ = 9,007,199,254,740,992" is not a multiple of 100. Instead, dividing by 100, I get 90,071,992,547,409 with 92 left over. So, with a 53-bit RNG, we have the same problem where 92 results will be more likely to be generated than 8 others. Those unlucky 8 are 11, 22, 33, 45, 58, 66, 79, and 91.

The only time this bias would not exhibit itself in the naive "multiply-and-floor" approach above, is if the random number requested is in the interval [0, 2^N), where "N" is any positive integer. 2³², 2⁵³, and 2^X, where "X" is a positive integer, is always a multiple of 2^N (2^N divides 2^X evenly, when N â‰¤ X, N > 0).

So, what do we do? How do we improve the naive multiply-and-floor approach? Thankfully, it's not too difficult. All we need to do is essentially the following:

Force the RNG into 32-bits (common denominator for all browsers).

Create a range of values that is a multiple of our desired range (E.G.: 1-100).

Loop over the range picking values until a value inside the range is generated.

Output the generated value modulo our desired range.

Let's see this in practice. First the unbiased code, then the explanation:

1
2
3
4
5
6
7
function uniformRandNumber(range) {
var max = Math.floor(2**32/range) * range; // make "max" a multiple of "range"
do {
var x = Math.floor(Math.random() * 2**32); // pick a number of [0, 2^32).
} while(x >= max); // try again if x is too big
return(x % range); // uniformly picked in [0, range)
}

I know what you're thinking: WAIT! YOU JUST DID THE "MULTIPLY AND FLOOR" METHOD!! HYPOCRITE!!! Hold on though. There are two subtle differences. See what they are?

The "max" variable is a multiple of "range" (step 2 above). So, if our range is [0, 100), then "max = 4294967200", which is a multiple of 100. This means that so long as "0 < = x < 4294967200", we can return "x % 100", and know that our number was uniformly chosen. However, if "x >= 4294967200", then we need to choose a new "x", and check if it falls within our range again (step 3 above). So long as "x" falls in [0, 4294967200), then we're good.

This extends to cryptographically secure random numbers too. In action, it's just:

1
2
3
4
5
6
7
8
function uniformSecureRandNumber(range) {
const crypto = window.crypto || window.msCrypto; // Microsoft vs everyone else
var max = Math.floor(2**32/range) * range; // make "max" a multiple of "range"
do {
var x = crypto.getRandomValues(new Uint32Array(1))[0]; // pick a number of [0, 2^32).
} while(x >= max); // try again if x is too big
return(x % range); // uniformly picked in [0, range)
}

So it's not that "multiply and floor" is wrong so long as you use it correctly.

One small caveat- these examples are not checking if "range" is larger than 32-bits. I deliberately ignored this to draw your attention on how to correctly generate uniform random numbers. You may or may not need to do various checks on the "range" argument. Is it an integer type? Is it a positive integer? Is it 32-bits or less? Etc.

As an exercise for the reader, how could you extend this uniform generator to pick a random number in the range of [100, 200)? Going further, how could you pick only a random even number in the range of [250, 500)?

Do Not Use sha256crypt / sha512crypt - They're Dangerous

Aaron Toponce — Wed, 23 May 2018 12:56:41 +0000

Introduction

I'd like to demonstrate why I think using sha256crypt or sha512crypt on current GNU/Linux operating systems is dangerous, and why I think the developers of GLIBC should move to scrypt or Argon2, or at least bcrypt or PBKDF2.

History and md5crypt

In 1994, Poul-Henning Kamp (PHK) added md5crypt to FreeBSD to address the weaknesses of DES-crypt that was common on the Unix and BSD systems of the early 1990s. DES-Crypt has a core flaw in that, not only DES reversible (which necessarily isn't a problem here), and incredibly fast, but it also limited password length to 8 characters (each of those limited to 7-bit ASCII to create a 56-bit DES key). When PHK created md5crypt, one of the things he made sure to implement as a feature was to support arbitrary-length passwords. In other words, unlike DES-Crypt, a user could have passwords greater than 9 or more characters.

This was "good enough" for 1994, but it had an interesting feature that I don't think PHK thought of at the time- md5crypt execution time is dependent on password length. To prove this, I wrote a simple Python script using passlib to hash passwords with md5crypt. I started with a single "a" character as my password, then increased the password length by appending more "a"s up until the password was 4,096 "a"s total.

1
2
import time
from passlib.hash import md5_crypt

1

1

md5_results = [None] * 4096

1

1

for i in xrange(0, 4096):
print i,
pw = "a" * (i+1)
start = time.clock()
md5_crypt.hash(pw)
end = time.clock()
md5_results[i] = end - start

1

1
2
3
with open("md5crypt.txt", "w") as f:
for i in xrange(0, 4096):
f.write("{0} {1}\n".format(i+1, md5_results[i]))

Nothing fancy. Start the timer, hash one "a" with md5crypt, stop the timer, and record the results. Start the timer, hash two "a"s with md5crypt, stop the timer, and record the results. Wash, rinse, repeat, until the password is 4,096 "a"s in length.

What do the timing results look like? Below are scatter plots of timing md5crypt for passwords of 1-128, 1-512, and 1-4,096 characters in length:

md5crypt 1-128 characters

md5crypt 1-512 characters

md5crypt 1-4,096 characters

At first, you wouldn't think this is a big deal; in fact, you may even think you LIKE it (we're supposed to make things get slower, right? That's a good thing, right???). But, upon deeper inspection, this actually is a flaw in the algorithm's design for two reasons:

Long passwords can create a denial-of-service on the CPU (larger concern).

Passive observation of execution times can predict password length (smaller concern).

Now, to be fair, predicting password length based on execution time is ... meh. Let's be honest, the bulk of passwords will be between 7-10 characters. And because these algorithms operate in block sizes of 16, 32, or 64 bytes, an adversary learning "AHA! I know your password is between 1-16 characters" really isn't saying much. But, should this even exist in a cryptographic primitive? Probably not. Still, the larger concern would be users creating a DoS on the CPU, strictly by changing password length.

I know what you're thinking- it's 2018, so there should be no reason why any practical length password cannot be adequately hashed with md5crypt insanely quickly, and you're right. Except, md5crypt was invented in 1994, 24 years ago. According to PHK, he designed it to take about 36 milliseconds on the hardware he was testing, which would mean a speed about 28 per second. So, it doesn't take much to see that by increasing the password's length, you can increase execution time enough to affect a busy authentication server.

The question though, is why? Why is the execution time dependent on password length? This is because md5crypt processes the hash for every 16 bytes in the password. As a result, this creates the stepping behavior you see in the scatter plots above. A good password hashing design would not do this.

PHK eventually sunset md5crypt in 2012 with CVE-2012-3287. Jeremi Gosney, a professional password cracker, demonstrated with Hashcat and 8 clustered Nvidia GTX 1080Ti GPUS, that a password cracker could rip through 128.4 million md5crypt guesses per second.

You should no longer be implementing md5crypt for your password hashing.

sha2crypt and NIH syndrome

In 2007, Ulrich Drepper decided to improve things for GNU/Linux. He recognized the threat that GPU clusters, and even ASICs, posed on fast password cracking with md5crypt. One aspect of md5crypt was the hard-coded 1,000 iterations spent on the CPU, before the password hash was finalized. This cost was not configurable. Also, MD5 was already considered broken, with SHA-1 showing severe weaknesses, so he moved to SHA-2 for the core of his design.

The first thing addressed, was to make the cost configurable, so as hardware improved, you could increase the iteration count, thus keeping the cost for calculating the final hash expensive for password crackers. However, he also made a couple core changes to his design that differed from md5crypt, which ended up having some rather drastic effects on its execution.

Using code similar to above with Python's passlib, but rather using the sha256_crypt() and sha512_crypt() functions, we can create scatter plots of sha256crypt and sha512crypt for passwords up to 128-characters, 512-characters, and 4,096-characters total, just like we did weth md5crypt. How do they fall out? Take a look:

sha256crypt 1-128 characters

sha256crypt 1-512 characters

sha256crypt 1-4,096 characters

sha512crypt 1-128 characters

sha512crypt 1-512 characters

sha512crypt 1-4,096 characters

Curious. Not only do we see the same increasing execution time based on password length, but unlike md5crypt, that growth is polynomial. The changes Ulrich Drepper made from md5crypt are subtle, but critical. Essentially, not only do we process the hash for every character in the password per round, like md5crypt, but we process every character in the password three more times. First, we take the binary representation of each bit in the password length, and update the hash based on if we see a "1" or a "0". Second, for every character in the password, we update the hash. Finally, again, for every character in the password, we update the hash.

For those familiar with big-O notation, we end up with an execution run time of O(pw_length² + pw_length*iterations). Now, while it is true that we want our password hashing functions to be slow, we also want the iterative cost to be the driving factor in that decision, but that isn't the case with md5crypt, and it's not the case with sha256crypt nor sha512crypt. In all three cases, the password length is the driving factor in the execution time, not the iteration count.

Again, why is this a problem? To remind you:

Long passwords can create a denial-of-service on the CPU (larger concern).

Passive observation of execution times can predict password length (smaller concern).

Now, granted, in practice, people aren't carrying around 4 kilobyte passwords. If you are a web service provider, you probably don't want people uploading 5 gigabyte "passwords" to your service, creating a network denial of service. So you would probably be interested in creating an adequate password maximum, such as what NIST recommends at 128 characters, to prevent that from occurring. However, if you have an adequate iterative cost (such as say, 640,000 rounds), then even moderately large passwords from staff, where such limits may not be imposed, could create a CPU denial of service on busy authentication servers.

As with md5crypt, we don't want this.

Now, here's what I find odd about Ulrich Drepper, and his design. In his post, he says about his specification (emphasis mine):

Well, there is a problem. I can already hear everybody complaining that I suffer from the NIH syndrome but this is not the reason. The same people who object to MD5 make their decisions on what to use also based on NIST guidelines. And Blowfish is not on the lists of the NIST. Therefore bcrypt() does not solve the problem.

What is on the list is AES and the various SHA hash functions. Both are viable options. The AES variant can be based upon bcrypt(), the SHA variant could be based on the MD5 variant currently implemented.

Since I had to solve the problem and I consider both solutions equally secure I went with the one which involves less code. The solution we use is based on SHA. More precisely, on SHA-256 and SHA-512.

PBKDF2 was standardized as an IETF standard in September 2000, a full 7 years before Ulrich Drepper created his password hashing functions. While PBKDF2 as a whole would not be blessed by NIST until 3 years later, in December 2010 in SP 800-132, PBKDF2 can be based on functions that, as he mentioned, were already in the NIST standards. So, just like his special design that is based on SHA-2, PBKDF2 can be based on SHA-2. Where he said "I went with the one which involves less code", he should have gone with PBKDF2, as code had already long since existed in all sorts of cryptographic software, including OpenSSL.

This seems to be a very clear case of NIH syndrome. Sure, I understand not wanting to go with bcrypt, as it's not part of the NIST standards . But don't roll your own crypto either, when algorithms already exist for this very purpose, that ARE based on designs that are part of NIST.

So, how does PBKDF2-HMAC-SHA512 perform? Using similar Python code with the passlib password hashing library, it was trivial to put together:

PBKDF2-HMAC-SHA512 1-128 characters

PBKDF2-HMAC-SHA512 1-512 characters

PBKDF2-HMAC-SHA512 1-4,096 characters

What this clearly demonstrates, is that the only factor driving execution time, is the number of iterations you apply to the password, before delivering the final password hash. This is what you want to achieve, not giving the opportunity for a user to create a denial-of-service based on password length, nor an adversary learn the length of the user's password based on execution time.

This is the sort of details that a cryptographer or cryptography expert would pay attention to, as opposed to an end-developer.

It's worth pointing out that PBKDF2-HMAC-SHA512 is the default password hashing function for Mac OS X, with a variable cost between 30,000 and 50,000 iterations (typical PBKDF2 default is 1,000).

OpenBSD, USENIX, and bcrypt

Because Ulrich Drepper brought up bcrypt, it's worth mentioning in this post. First off, let's get something straight- bcrypt IS NOT Blowfish. While it's true that bcrypt is based on Blowfish, they are two completely different cryptographic primitives. bcrypt is a one-way cryptographic password hashing function, where as Blowfish is a two-way 64-bit block symmetric cipher.

At the 1999 USENIX conference, Niels Provos and David MaziÃ¨res, of OpenBSD, introduced bcrypt to the world (it was actually in OpenBSD 2.1, June 1, 1997). They were critical of md5crypt, stating the following (emphasis mine):

MD5 crypt hashes the password and salt in a number of different combinations to slow down the evaluation speed. Some steps in the algorithm make it doubtful that the scheme was designed from a cryptographic point of view--for instance, the binary representation of the password length at some point determines which data is hashed, for every zero bit the first byte of the password and for every set bit the first byte of a previous hash computation.

PHK was slightly offended by their off-handed remark that cryptography was not his core consideration when designing md5crypt. However, Niels Provos was a graduate student in the Computer Science PhD program at the University of Michigan at the time. By August 2003, he had earned his PhD. Since 1997, bcrypt has withstood the test of time, it has been considered "Best Practice" for hashing passwords, and is still well received today, even though better algorithms exist for hashing passwords.

bcrypt limits password input to 72 bytes. One way around the password limit is with pre-hashing. A common approach in pseudocode is to hash the password with SHA-256, encode the digest into base64, then feed the resulting ASCII string into bcrypt. However, make sure to salt the prehash, or you fall victim to breach correlation attacks. Using HMAC is a better option than generic cryptographic hashes, as it has a construction for properly handling secret keys. In this case, a site-wide secret known as a "pepper" is appropriate.

In pseudocode:

pwhash = bcrypt(base64(hmac-sha-256(password, pepper, 256)), salt, cost)

This results in a 44-byte password (including the "=" padding) that is within the bounds of the 72 byte bcrypt limitation. This prehashing allows users to have any length password, while only ever sending 44 bytes to bcrypt. My implementation in this benchmark uses the passlib.hash.bcrypt_sha256.hash() method. How does bcrypt compare to md5crypt, sha256crypt, and sha512crypt in execution time based on password length?

bcrypt 1-128 characters (prehashed)

bcrypt 1-512 characters (prehashed)

bcrypt 1-4,096 characters (prehashed)

Now, to be fair, bcrypt is only ever hashing 44 byte passwords in the above results, because of my prehashing. So of course it's running in constant time. So, how does it look with hashing 1 to 72 character passwords without prehashing?

bcrypt 1-72 characters (raw)

Again, we see consistent execution, driven entirely by iteration cost, not by password length.

Colin Percival, Tarsnap, and scrypt

In May 2009, mathematician Dr. Colin Percival presented to BSDCan'09 about a new adaptive password hashing function called scrypt, that was not only CPU expensive, but RAM expensive as well. The motivation was that even though bcrypt and PBKDF2 are CPU-intensive, FPGAs or ASICs could be built to work through the password hashes much more quickly, due to not requiring much RAM, around 4 KB. By adding a memory cost, in addition to a CPU cost to the password hashing function, we can now require the FPGA and ASIC designers to onboard a specific amount of RAM, thus financially increasing the cost of production. scrypt recommends a default RAM cost of at least 16 MB. I like to think of these expensive functions as "security by obesity".

scrypt was initially created as an expensive KDF for his backup service Tarsnap. Tarsnap generates client-side encryption keys, and encrypts your data on the client, before shipping the encrypted payload off to Tarsnap's servers. If at any event your client is lost or stolen, generating the encryption keys requires knowing the password that created them, and attempting to discover that password, just like typical password hashing functions, should be slow.

It's now been 9 years as of this post, since Dr. Percival introduced scrypt to the world, and like bcrypt, it has withstood the test of time. It has received, and continues to receive extensive cryptanalysis, is not showing any critical flaws or weaknesses, and as such is among the top choices as a recommendation from security professionals for password hashing and key derivation.

How does it fare with its execution time per password length?

scrypt 1-128 characters

scrypt 1-512 characters

scrypt 1-4,096 characters

I'm seeing a trend here.

The Password Hashing Competition winner Argon2

In 2013, an open public competition, in the spirit of AES and SHA-3, was held to create a password hashing function that approached password security from what we knew with modern cryptography and password security. There were many interesting designs submitted, including a favorite of mine by Dr. Thomas Pornin of StackExchange fame and BearSSL, that used delegation to reduce the work load on the honest, while still making it expensive for the password cracker.

In July 2015, the Argon2 algorithm was chosen as the winner of the competition. It comes with a clean approach of CPU and memory hardness, making the parameters easy to tweak, test, and benchmark. Even though the algorithm is relatively new, it has seen at least 5 years of analysis, as of this writing, and has quickly become the "Gold Standard" for password hashing. I fully recommend it for production use.

Any bets on how it will execution times will be affected by password length? Let's look:

Argon2 1-128 characters

Argon2 1-512 characters

Argon2 1-4,096 characters

Execution time is not affected by password length. Imagine that. It's as if cryptographers know what they're doing when designing this stuff.

Conclusion

Ulrich Drepper tried creating something more secure than md5crypt, on par with bcrypt, and ended up creating something worse. Don't use sha256crypt or sha512crypt; they're dangerous.

For hashing passwords, in order of preference, use with an appropriate cost:

Argon2 or scrypt (CPU and RAM hard)

bcrypt or PBKDF2 (CPU hard only)

Avoid practically everything else:

md5crypt, sha256crypt, and sha512crypt

Any generic cryptographic hashing function (MD5, SHA-1, SHA-2, SHA-3, BLAKE2, etc.)

Any complex homebrew iterative design (10,000 iterations of salted SHA-256, etc.)

Any encryption design (AES, Blowfish (ugh), ChaCha20, etc.)

UPDATE: 2020-12-28:Debian just pushed Linux PAM 1.4.0 into the unstable repository. This enables bcrypt password hashing for Debian and Debian-based systems by default without any 3rd party tools to custom source code compilation. It is strongly advised that you drop sha256crypt/sha512crypt in favor of bcrypt.

UPDATE: A note about PBKDF2 that was brought up in a Twitter thread from @solardiz. PBKDF2-HMAC-SHA512 isn't really an upgrade from sha512crypt (nor PBKDF2-HMAC-SHA256 an upgrade from sha256crypt), because PBKDF2 really isn't GPU resistant in the way bcrypt is. However, bcrypt can be implemented cheaply on ASICs with only 4 KB of memory.

If your choice of password hashing in constrained to NIST standards, which includes PBKDF2, then unfortunately, bcrypt, scrypt, and Argon2 are out of the question; just make sure to use it properly, which includes choosing a high iteration count based on your authentication load capacity. At that point, password storage is probably not the worst of your security concerns.

However, if you're not limited to NIST constraints, then use the others.

Acknowledgement

Thanks to Steve Thomas (@Sc00bzT) for our discussions on Twitter for helping me see this quirky behavior with sha256crypt and sha512crypt.

Use A Good Password Generator

Aaron Toponce — Thu, 19 Apr 2018 14:26:39 +0000

Introduction

For the past several months now, I have been auditing password generators for the web browser in Google Sheets. It started by looking for creative ideas I could borrow or extend upon for my online password generator. Sure enough, I found some, such as using mouse movements as a source of entropy to flashy animations of rolling dice for a Diceware generator. Some were deterministic, using strong password hashing or key derivation functions, and some had very complex interfaces, allowing you to control everything from letter case to pronounceable passwords and unambiguity.

However, the deeper I got, the more I realized some were doing their generation securely and others weren't. I soon realized that I wanted to grade these online generators and sift out the good from the bad. So, I created a spreadsheet to keep track of what I was auditing, and it quickly grew from "online generators" to "password generators and passphrase generators", to "web password, web passphrase, bookmarklet, chrome extensions, and firefox extenions".

When all was said and done, I had audited 300 total generators that can be used with the browser. Some were great while some were just downright horrible. So, what did I audit, why did I choose that criteria, and how did the generators fall out?

I audited:

Software license

Server vs. client generation

RNG security, bias, and entropy

Generation type

Network security

Mobile support

Ads or tracker scripts

Subresource integrity

No doubt this is a "version 1.0" of the spreadsheet. I'm sure those in the security community will mock me for my choices of audit categories and scoring. However, I wanted to be informed of how each generator was generating the passwords, so when I made recommendations about using a password generator, I had confidence that I was making a good recommendation.

Use A Password Manager

Before I go any further, the most important advice with passwords, is to use a password manager. There are a number of reasons for this:

They encourage unique passwords for each account.

They encourage passwords with sufficient entropy to withstand offline clustered attacks.

They allow storage of other types of data, such as SSNs or credit card numbers.

Many provide online synchronization support across devices, either internally or via Dropbox and Google Drive.

Many ship additional application support, such as browser extensions.

They ship password generators.

So before you go any further, the Best Practice for passwords is "Use A Password Manager". As members of the security community, this is the advice we should be giving to our clients, whether they are family, friends, coworkers, or professional customers. But, if they are already using a password manager, and discussions arise about password generation, then this audit is to inform members of the security community which generators are "great", which generators are "good", which generators are "okay", and finally, which generators are "bad".

So to be clear, I don't expect mom, dad, and Aunt Josephine to read this spreadsheet, so they make informed decisions about which password generator to use. I do hope however that security researchers, cryptographers, auditors, security contractors, and other members of the security community to take advantage of it.

So with that said, let me explain the audit categories and scoring.

Software License

In an ethical software development community, there is value granted when software is licensed under a permissive "copyleft" license. Not necessarily GPL, but any permissive license, from the 2-clause BSD to the GPL, from the Creative Commons to unlicensed public domain software. When the software is licensed liberally, it allows developers to extend, modify, and share the code with the larger community. There are a number of different generators I found in my audit where this happened; some making great changes in the behavior of the generator, others not-so-much.

License Open Source Proprietary

So when a liberal license was explicitly specified, I awarded one point for being "Open Source" and no points for being "Proprietary" when a license either was either not explicitly specified or was licensed under a EULA or "All Rights Reserved".

This is something to note- just because the code is on a public source code repository, does not mean the code is licensed under a permissive license. United States copyright law states that unless explicitly stated, all works fall under a proprietary "All Rights Reserved" copyright to the creator. So if you didn't specify a license in your Github repository, it got labeled as "Proprietary". It's unfortunate, because I think a lot of generators missed getting awarded that point for a simple oversight.

Server vs. Client Generation

Every generator should run in the browser client without any knowledge of the generation process by a different computer, even the web server. No one should have any knowledge whatsoever of what passwords were generated in the browser. Now, I recognize that this is a bit tricky. When you visit a password generator website such as my own, you are showing a level of trust that the JavaScript delivered to your browser is what you expect, and is not logging the generated passwords back to the server. Even with TLS, unless you're checking the source code on every page refresh and disconnecting your network, you just cannot guarantee that the web server did not deliver some sort of logging script.

Generator Client Server

With that said, you still should be able to check the source code on each page refresh, and check if it's being generated in the client or on the server. I awarded one point of being a "Client" generator and no points for being a "Server" generator. Interestingly enough, I thought I would just deal with this for the website generators, and wouldn't have to worry about this with bookmarklets or browser extensions. But I was wrong. I'll expand on this more in the "Network Security" category, but suffice it to say, this is still a problem.

Generation Type

I think deterministic password generators are fundamentally flawed. Fatally so, even. Tony Arcieri wrote a beautiful blog post on this matter, and it should be internalized across the security community. The "four fatal flaws" of deterministic password generators are:

Deterministic password generators cannot accommodate varying password policies without keeping state.

Deterministic password generators cannot handle revocation of exposed passwords without keeping state.

Deterministic password managers canâ€t store existing secrets.

Exposure of the master password alone exposes all of your site passwords.

Number 4 in that list is the most fatal. We all know the advice that accounts should have unrelated randomized unique passwords. When one account is compromised, the exposed password does not compromise any of the other accounts. This is not the case with deterministic password generators. Every account that uses a password from a deterministic generator shares a common thread via the master secret. When that master secret is exposed, all online accounts remain fatally vulnerable to compromise.

Proponents of deterministic generators will argue that discovery of the master password of an encrypted password manager database will also expose every online account to compromise. They're not wrong, but let's consider the attack scenarios. In the password manager scenario, a first compromise needs to happen in order to get access to the encrypted database file. Then a second compromise needs to happen in discovering the master password. But with deterministic generators, only one compromise needs to take place- that is using dictionary or brute force attacks to discover the master password that led to password for the online account.

With password managers, two compromises are necessary. With determenistic generators, only one compromise is necessary. As such, the generator was awardeded a point for being "Random" and no points for being "Deterministic".

Generator Random Unknown Deterministic

RNG Security

Getting random number generation is one of those least understood concepts in software development, but ironically, developers seem to think they have a firm grasp of the basic principles. When generating passwords, never at any point should a developer choose anything but a cryptographic random number generator. I awarded one point for using a CRNG, and no points otherwise. In some cases, the generation is done on the server, so I don't know or can't verify its security, and in some cases, the code is so difficult to analyze, that I cannot determine its security that way either.

CRNG Yes Maybe Unknown No

In JavaScript, using a CRNG primarily means using the Web Crypto API via "window.crypto.genRandomValues()", or "window.msCrypto.getRandomValues()" for Microsoft-based browsers. Never should I see "Math.random()". Even though it may be cryptographically secure in Opera, it likely isn't in any of the other browsers. Some developrs shipped the Stanford JavaScript Cryptographic Library. Others shipped a JavaScript implementation of ISAAC, and others yet shipped some AES-based CSPRNG. While these are "fine", you really should consider ditching those userspace scripts in favor of just calling "window.crypto.getRandomValues()". It's less software the user has to download, and as a developer, you are less likely to introduce a vulnerability.

Also, RC4 is not a CSPRNG, neither is ChaCha20, SHA-256, or any other hash function, stream cipher, or block cipher. So if you were using some JavaScript library that is using a vanilla hashing function, stream cipher, or block cipher as your RNG, then I did not consider it as secure generation. The reason being, is that even though ChaCha20 or SHA-256 may be prediction resistant, it is not backtracking resistant. To be a CRNG, the generator must be both prediction and backtracking resistant.

However, in deterministic password generators that are based on a master password, the "RNG" (using this term loosely here) should be a dedicated password hashing function or password-based key derivation function with an appropriate cost. This really means using only:

sha256crypt or sha512crypt with at least 5,000 rounds.

PBKDF2 with at least 1,000 rounds.

bcrypt with a cost of at least 5.

scrypt with a cost of at least 16 MiB of RAM and at least 0.5s execution.

Argon2 with sufficient cost of RAM and execution time.

Anything else, like hashing the master password or mouse entropy with MD5, SHA-1, SHA-2, or even SHA-3 will not suffice. The goal of those dedicated password generators or key derivation functions is to slow down an offline attempt at discovering the password. Likely the master password does not contain sufficient entropy, so it remains the weakest link in the generation process. By using a dedicated password hashing or key derivation function with an appropriate cost, we can guarantee a certain "speed limit" with offline clustered password attacks, making it difficult to reconstruct the password.

RNG Uniformity

Even though the developer may have chosen a CRNG, they may not be using the generator uniformly. This is probably the most difficult aspect of random number generation to grasp. It seems harmless enough to call "Math.floor(Math.random() * length)" or "window.crypto.getRandomValues(new UInt32Array(1))[0] % length". In both cases, unless "length" is a power of 2, the generator is biased. I awarded one point for being an unbiased generator, and zero points otherwise.

Uniform Yes Maybe Unknown No

To do random number generation in an unbiased manner, you need to find how many times the length divides the period of the generator, and note the remainder. For example, if using "window.crypto.getRandomValues(new UInt32Array(1))", then the generator has a period of 32-bits. If your length is "7,776", is in the case of Diceware, then 7,776 divides 2³² 552,336 times with a remainder of 2,560. This means that 7,776 divides values 0 through 2³²-2,561 evenly. So if your random number is between the range of 2³²-2,560 through 2³²-1, the value needs to be tossed out, and a new number generated.

Oddly enough, you could use a an insecure CRNG, such as SHA-256, but truncate the digest to a certain length. While the generator is not secure, the generator in this case is unbiased. More often than not actually, deterministic generators seem to fall in this category, where a poor hashing function was chosen, but the digest was truncated.

RNG Entropy

I've blogged about this a number of times, so I won't repeat myself here. Search by blog for password entropy, and get caught up with the details. I awarded one point for generators with at least 70 bits of entropy, 0.5 points for 55 through 69 bits of entropy, and no points for entropy less than 55 bits.

Entropy 70 69 55 54

I will mention however that I chose the default value that was presented to me when generating the password. Worse, if I was forced to chose my length, and I could chose a password of one character, then I awarded it as such. When you're presenting password generators to people like Aunt Josephine, they'll do anything they can do get away with as little as possible. History has shown this is 6-8 characters in length. This shouldn't be possible. A few Nvidia GTX960 GPUs can crack every 8 character ASCII password hashed with SHA-1 in under a week. There is no reason why the password generator should not present minimum defaults that are outside the scope of practical hobbyist brute force searching.

So while that may be harsh, if you developed one of the generators in my audit, and you were dinged for this- I'm calling you out. Stop it.

Network Security

When delivering the software for the password generation, it's critical that the software is delivered over TLS. There should be no room for a man-in-the-middle to inject malicious code to discover what passwords your generating, send you a determined list passwords, or any other sort of compromise. This means, however, that I expect a green lock in the browser's address or status bars. The certificate should not be self-signed, it should not be expired, it should not be missing intermediate certificates, it should not be generated for a different CN, or any other certificate problems. Further, the site should be HTTPS by default.

HTTPS Yes Not Default Expired No

I awarded one point for "Yes", serving the software over secure and problem-free HTTPS, and zero points otherwise.

Mobile View Support

For the most part, due to their ubiquity, developers are designing websites that support mobile devices with smaller screens and touch interfaces. It's as simple as adding a viewport in the HTML header, and as complex as customized CSS and JavaScript rules for screen widths, user agents, and other tests. Ultimately, when I'm recommending a password generator to Aunt Josephine while she's using her mobile phone, she shouldn't have to pinch-zoom, scroll, and other nuisances when generating and copying the password. As silly as that may sound, if the site doesn't support a mobile interface, then it wasn't awarded a point.

Mobile Yes No

Ads and Tracker Scripts

I get it. I know that as a developer, you want easy access to analytics about who is visiting your site. Having that data can help you learn what is driving traffic to your site, and how you can make adjustments as necessary to improve that traffic. But when it comes to password generators, no one other than me and the server that sent the code to my browser, should know that I'm about to generate passwords. I awarded a point for not shipping trackers, and zero points if the generator did.

Trackers No Yes

Google Analytics, social media scripts, ads, and other 3rd party browser scripts track me via fingerprinting, cookies, and other methods to learn more about who I am and what I visited. If you want to see how extensive this is, install the Lightbeam extension for Firefox. This shows the capability of companies to share data and learn more about who you are and where you've been on the web. Recently, Cambridge Analytica, a small unknown company, was able to mine data on millions of Facebook users, and the data mining exposed just how easy it was for Facebook to track your web behavior even when not on the site.

At first, I thought this would be just something with website generators, but when I started auditing browser extensions, I quickly saw that developers were shipping Google Analytics, and other tracking scripts in the bundled extension as well.

Offline

When I started auditing the bookmarklets and extensions, I pulled up my developer console, and watched for any network activity when generating the passwords or using the software. To my dismay, some do "call home" by either wrapping the generator around an or requiring an account, such as in the case with password managers. I awarded one point for staying completely offline, zero points for any sort of network activity.

Offline Yes No

Now, you may be thinking that this isn't exactly fair for password generators or just s, but the generation is still happening client-side and without trackers. For the most part, I agree, except, when installing browser extensions, the developer has the opportunity to make the password generation fully offline. That may sound a touch silly for a browser extension, but regardless, it removes the risk of a web server administrator from modifying the delivered JavaScript on every page load. The only times the JavaScript would be changed, is when the extension is updated. This behaves more like standard desktop software.

Subresource Integrity

Finally, the last category I audited was subresource integrity. SRI is this idea that a developer can add cryptographic integrity to and
"Compromise of a third-party service should not automatically mean compromise of every site which includes its scripts. Content authors will have a mechanism by which they can specify expectations for content they load, meaning for example that they could load a specific script, and not any script that happens to have a particular URL."

So SRI guarantees that even though the data is delivered under TLS, if the cryptographic hashes are valid, then the data has not been compromised, even if the server has. More information about SRI can be found at the W3C Github page, and I encourage everyone to check out the links there.

I awarded one point if "Yes" or "N/A" (no resources called), and zero points for "No".

SRI Yes N/A No

Scoring

With all these audit categories taken into account, I gave a total score to see how generators ranked among others. Each generator has different auditing criteria, so the score varies from generator type to generator type. I treated the scoring like grade school- 100% is an "A", 99% to roughly 85% as "great", 84% to 51% is "okay", and 50% or less is failing. I translated that grade school score into the following:

Score Perfect Perfect - 1 Perfect - 2 51% 50%

When you look at the spreadsheet, you'll see that it is sorted first by score in descending order then alphabetically by "Name" in ascending order. It's sorted this way, so as a security consultant, you can quickly get a high-level overview of the "good" versus the "bad", and you should be able to find the generator you're looking for quickly. The names are just generic unique identifiers, and sometimes there are notes accompanying each generator when I found odd behavior, interesting generation templates, or other things that stood out.

Conclusion

This audit is for the security community, not for Aunt Josephine and friends or family. I don't expect you to share this with your dad or your spouse or your Facebook friends, and expect them to know what they're looking at, or why they should care. However, I recently took a poll on Twitter, asking the security community if they recommended password generators to family and friends. Only 90 votes came in, but 69% of you said that you did, with a few comments coming in saying that they would only recommend them if they shipped with a password manager (see the beginning of this post). I suspect that some of the 31% of the "No" votes are in this category.

So, of those 69% of you that said "Yes", I hope this audit will be of some value.

The Entropy of a Digital Camera CCD/CMOS Sensor

Aaron Toponce — Fri, 22 Dec 2017 08:21:00 +0000

Recently, Vault12 released an app for iOS that uses the mobile device's camera as a source of randomness. Unfortunately, when putting the generated binary files through the Dieharder tests, it comes out pretty bad. I get 20 "PASSED", 13 "WEAK", and 81 "FAILED" results. For a TRNG, it should be doing much better than that. Now, to be clear, I'm not writing this post to shame Vault12. I actually really love the TrueEntropy app, and am eagerly waiting for it to hit Android, so I can carry around a TRNG in my pocket. However, things can get better, and that is what this post is hopefully addressing.

Using a camera as a TRNG is nothing new. SGI created a patent for pointing a webcam at a lava lamp, using the chaotic nature of the lava lamp itself as the source of entropy. Later, it was realized that this was unnecessary. The CCD/CMOS in the camera was experiencing enough noise from external events to be more than sufficient. This noise is shown in the photo itself, and is appropriately referred to as "image noise".

The primary sources of noise come from the following:

Thermal noise- Caused by temperature fluctuations due to electrons flowing across resistant mediums.

Photon noise- Caused by photons hitting the CCD/CMOS and releasing energy to neighboring electronics.

Shot noise- Caused by current flow across diodes and bipolar transistors.

Flicker noise- Caused by traps due to crystal defects and contaniments in the CCD/CMOS chip.

Radiation noise- Caused by alpha, beta, gamma, x-ray, and proton decay from radioactive sources (such as outer-space) interacting with the CCD/CMOS.

Some of these noise sources can be manipulated. For example, by cooling the camera, we can limit thermal noise. A camera at 0 degrees Celsius will experience less noise than one at 30 degrees Celsius. A camera in a dark room with less photons hitting the sensor will experience less noise than a bright room. Radiation noise can be limited by isolating the sensor in a radiation-protective barrier.

Let's put this to the test, and see if we can actually calculate the noise in a webcam. To do this, we'll look at a single frame with the lens cap covered, where the photo is taken in a dark room, and the web cam is further encompassed in a box. We'll take the photo at about 20 degrees Celsius (cool room temperature).

In order to get a basis for the noise in the frame, we'll use Shannon Entropy from information theory. Thankfully, understanding Shannon Entropy isn't that big of a deal. My frame will be taken with OpenCV from a PlayStation 3 Eye webcam, which means the frame itself is just a big multidimensional array of numbers between 0 and 255 (each pixel only provides 8 bits of color depth). So, to calculate the Shannon Entropy of a frame, we'll do the following:

Place each number in its own unique bin of 0 through 255.

Create an observed probability distribution (histogram) by counting the numbers in each bin.

Normalize the distribution, creating 256 p-values (the sum of which should equal "1").

For each of the 256 bins, calculate: -p_i*log_2(p_i).

Sum the 256 values.

Thankfully, I don't have all of this by hand- numpy provides a function for me to call that does a lot of the heavy lifting for me.

So, without further ado, let's look at the code, then run it:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
#!/usr/bin/python

import cv2
import math
import numpy

def max_brightness(frame):
hsv = cv2.cvtColor(frame, cv2.COLOR_BGR2HSV)
h, s, v = cv2.split(hsv)
v[v > 0] = 255
v[v < = 0] += 255
final_hsv = cv2.merge((h, s, v))
frame = cv2.cvtColor(final_hsv, cv2.COLOR_HSV2BGR)
return frame

def get_entropy(frame):
histogram = numpy.histogram(frame, bins=256)[0]
histogram_length = sum(histogram)
samples_probability = [float(h) / histogram_length for h in histogram]
entropy = -sum([p * math.log(p, 2) for p in samples_probability if p != 0])
return entropy

cap = cv2.VideoCapture(0)
cap.set(3, 640)
cap.set(4, 480)

ret, frame1 = cap.read()
ret, frame2 = cap.read()
frame_diff = cv2.absdiff(frame1, frame2)

cv2.imwrite('/tmp/frame1.bmp', frame1)
cv2.imwrite('/tmp/frame2.bmp', frame2)
cv2.imwrite('/tmp/frame_diff.bmp', frame_diff)

frame1_max = max_brightness(frame1)
frame2_max = max_brightness(frame2)
frame_diff_max = max_brightness(frame_diff)

cv2.imwrite('/tmp/frame1_max.bmp', frame1_max)
cv2.imwrite('/tmp/frame2_max.bmp', frame2_max)
cv2.imwrite('/tmp/frame_diff_max.bmp', frame_diff_max)

print("Entropies:")
print(" frame1: {}".format(get_entropy(frame1)))
print(" frame2: {}".format(get_entropy(frame2)))
print(" frame_diff: {}".format(get_entropy(frame_diff)))
print(" frame1_max: {}".format(get_entropy(frame1_max)))
print(" frame2_max: {}".format(get_entropy(frame2_max)))
print(" frame_diff_max: {}".format(get_entropy(frame_diff_max)))

Let's look over the code before running it. First, I'm actually capturing two frames right next to each other, then taking their composite difference. We know that a photo consists of its signal (the data most people are generally interested in) and its noise (the data they're not). By taking the composite difference between the two, I'm attempting to remove the signal. Because the frames were taken in rapid succession, provided nothing was drastically changing between the frames, most of the data will be nearly identical. So the signal should disappear.

But what about the noise? Well, as discussed above, the noise is a bit unpredictable and slightly unmanageable. Unlike my signal, the noise will be drastically different between the two frames. So, rather than removing noise, I'll actually be adding noise in the difference.

The next thing you'll notice is that I'm either maximizing or completely removing the luminosity in an HSV color profile. This is done just as a visual demonstration of what the noise actually "looks like" in the frame. You can see this below (converted to PNG for space efficiency).

Frame 1

Frame 2

Difference of frames 1 & 2

Frame 1 maxed luminosity

Frame 2 maxed luminosity

Difference of frames 1 & 2 maxed luminosity

Running the Python script in my 20 degrees Celsius dark room with the lens cap on and all light removed as much as possible, I get:

$ python frame-entropy.py Entropies: frame1: 0.0463253223509 frame2: 0.0525489364555 frame_diff: 0.0864590940377 frame1_max: 0.0755975713815 frame2_max: 0.0835428883103 frame_diff_max: 0.134900632254

The "ent(1)" userspace utility confirms these findings when saving the frames as uncompressed bitmaps:

$ for I in frame*.bmp; do printf "$I: "; ent "$I" | grep '^Entropy'; done frame1.bmp: Entropy = 0.046587 bits per byte. frame1_max.bmp: Entropy = 0.076189 bits per byte. frame2.bmp: Entropy = 0.052807 bits per byte. frame2_max.bmp: Entropy = 0.084126 bits per byte. frame_diff.bmp: Entropy = 0.086713 bits per byte. frame_diff_max.bmp: Entropy = 0.135439 bits per byte.

It's always good to use an independent source to confirm your findings.

So, in the standard frames, I'm getting about 0.05 bits per byte of entropy. However, when taking the composite difference, that number almost doubles to about 0.09 bits per byte. This was expected, as you remember, we're essentially taking the noise from both frames, and composing them in the final frame. Thus, the noise is added in the final frame.

What was actually surprising to me were the entropy values after setting the extreme luminosity values. This may be due to the fact that there are larger deltas between adjacent pixels when creating our histogram. When taking the difference of the two adjusted frames, the entropy jumps up to about 0.13 bits per byte. So, we could safely say that a composed frame with maxed luminosity that is the difference of two frames has about 0.11 bits of entropy per byte, plus or minus 0.02 bits per byte.

What does this say about the frame as a whole though? In my case, my frame is 640x480 pixels. Knowing that each pixel in my PS3 Eye webcam only occupies 1 byte or 8 bits, we can calculate the entropy per frame:

(640*480) pixels/frame * 1 byte/pixel = 307200 bytes/frame 307200 bytes/frame * 0.11 entropy bits/byte = 33792 entropy bits/frame

Each frame in my 640x480 PS3 Eye webcame provides about 33,792 bits of entropy. For comparison SHA-256 theoretically provides a maximum of 256-bits of entropic security. Of course, we should run millions of trials, collecting the data, calculate the standard deviation, and determine a more true average entropy. But, this will suffice for this post.

So, now that we know this, what can we do with this data? Well, we can use it as a true random number generator, but we have to be careful. Unfortunately, as the frame is by itself, it's heavily biased. In the frame, there exists spatial correlation with adjacent pixels. In the frame difference, there exists both spatial and time correlations. This isn't sufficient as a secure true random number generator. As such, we need to remove the bias. There are a few ways of doing this, called "randomness extraction", "software whitening", "decorrelation", or "debiasing". Basically, we want to take a biased input, and remove any trace of bias in the output.

We could use John von Neumann decorrelation, where we look at two non-overlapping consecutive bits. If the two bits are identical, then both bits are discarded. If they are different, then the most significant bit is output, while the least significant bit is discarded. This means that at a best, you are discarding half of your data, but how much is discarded all depends on how badly biased the data is. We know that our frame is only providing 0.11 bits of entropy per 8 bits. So we're keeping 11 bits out of 800. That's a lot of data that is discarded. One drawback with this approach, however, is if one or more bits are "stuck", such is an a dead pixel. Of course, this will lower the overall entropy of the frame, but will also drastically impact the extractor.

A different approach would be to use chaos machines. This idea is relatively new and not thoroughly studied; at least I'm struggling to find good research papers on the topic. The idea is taking advantage of the chaotic behavior of certain dynamical systems, such as a double pendulum. Due to the nature of chaos, small fluctuations in initial conditions lead to large changes in the distant future. The trick here is selecting your variables, such as the time distance and chaotic process correctly. Unlike John von Neumann decorrelation, that automatically discovers the bias and discards it for you, care has to be taken to make sure that the resulting output is debiased.

A better approach is using cryptographic primitives like one-way hashing or encryption, sometimes called "Kaminsky debiasing". Because most modern crytographic primitives are designed to emulate theoretical uniform unbiased random noise, the security rests on whether or not that can be achieved. In our case, we could encrypt the frame with AES and feed the ciphertext as our TRNG output. Unfortunately, this means also managing a key, which doesn't necessarily have to be kept secret. A better idea would be to use cryptographic hashing functions, like SHA-2, or even better, extendable output functions (XOFs).

Obviously, it should go without stating, that encrypting or hashing your biased input isn't increasing the entropy. This means that we need to have a good handle on what our raw entropy looks like (as we did above) beforehand. We know in our case that we're getting about 35 kilobits of entropy per frame, so hashing with SHA-256 is perfectly acceptable, even if we're losing a great deal of entropy in the output. However, if we were only getting 200-bits of security in each frame, while SHA-256 is debiasing the data, we still only have 200-bits of entropy in the generated output.

Really though, the best approach is an XOF. We want to output as much of our raw entropy as we can. Thankfully, NIST has 2 XOFs standardized as part of the SHA-3: SHAKE128 and SHAKE256. An XOF allows you to output a digest of any length, where SHA-256 for example, only allows 256-bits of output. The security margin of the SHAKE128 XOF function is the minimum of half of the digest or 128-bits. If I have an entropy 35 kilobits, I would like to have all of that entropy available in the output. As such, I can output 4 KB in the digest knowing full well that's within my entropy margin. Even though I'm losing as much data as the John von Neumann extractor, I'm not vulnerable to "stuck pixels" being a problem manipulating the extractor.

When I put all this code together in Python:

Take the difference of two consecutive overlapping frames.

Maximize the luminosity of the new composed frame.

Hash the frame with SHAKE128.

Output 4 KB of data as our true random noise.

At 30 frames per second for a resolution of 640x480, outputting 4 KB per frame will provide 120 KBps of data per second, and this is exactly what I see when executing the Python script. The PS3 Eye camera also supports 60 fps at a lower resolution, so I could get 240 KBps if I can keep the same security margin of 4 KB per frame. I haven't tested this, but intuition tells me I'll have a lower security margin at the lower resolution.

Coming full circle, when we put our TRNG to the Dieharder test, things come out vastly different than Vault12's results:

Vault12 TrueEntropy:

PASSED: 20

WEAK: 13

FAILED: 81

My webcam TRNG:

PASSED: 72

WEAK: 12

FAILED: 30

1,000 Books Read In One Year? No, Not By A Long Shot

Aaron Toponce — Wed, 27 Sep 2017 06:34:31 +0000

Recently, Goodreads sent out a tweet about how to remove social media and the Internet from your life, so you can focus on reading 1,000 books in one year. The post follows this sort of math:

The average person reads 400 words per minute.

The typical non-fiction books have around 50,000 words.

Reading 200 books will take you 417 hours.

The average person spends 608 hours on social media annually.

The average person spends 1,642 hours watching TV annually.

Giving up 2,250 hours annually will allow you to read 1,000 books in one year.

This blew my mind. I'm a very avid reader. Since signing up for Goodreads in 2013, I've been hitting at least 20,000 pages read every year, and I'm on track to read 25,000 pages this year. But, I'm only putting down 75 books each year. Now granted, a spare 2,250 hours per year Ã· 365 days per year is just over 6 hours per day of reading. I'm not reading 6 hours per day. I don't watch TV, I have a job, kids and a wife to take care of, and other things that keep me off the computer most of my time at home (I'm writing this blog post after midnight).

No doubt, 6 hours per day is a lot of reading. But I average 2 hours per day, and I'm only putting down 75 books annually. 6 hours of reading per day would only put me around 225 books each year, a far cry from the 1,000 I should be hitting. What gives?

Well, it turns out, Charles Chu is being a bit ... liberal with his figures. First off, the average person does not read 400 words per minute. Try about half, at only 200 words per minute, according to Iris Reading, a company that sells a product on improving your reading speed and memory comprehension. This cuts our max books from 1,000 in a year to 500.

Second, Chu claims the average non-fiction book is 50,000 words in length. I can tell you that 50,000 words is a very slim novel. This feels like a Louis L'Amour western length to me. Most books that I have read are probably closer to twice that length. However, according to HuffPost which quotes Amazon Text Stats, the average book is 64,000 words in length. But, according to this blog post by Writers Workshop, the average "other fiction" novel length is 70,000 to 120,000 words. This feels much more in line with what I've read personally, so I'll go with about 100,000 words in a typical non-fiction novel.

So now that brings our annual total down from 500 to 250 books. That's reading 200 words per minute, for 6 hours every day, with non-fiction books that average 100,000 words in length. I claimed that I would probably come in around 225 books, so this seems to be a much closer ballpark figure.

But, does it line up? Let's look at it the another way, and see if we can agree that 200-250 books annually, reading 6 hours per day, is more realistic.

I claimed I'm reading about 2 hours per day. I read about 3 hours for 4 of the 7 days in a week while commuting to work. For the other three days in the week, I can read anywhere from 1 hour to 3 hours, depending on circumstances. So my week can see anywhere from 13 hours to 15 hours on average. That's about 2 hours per day.

During 2016, I read 24,048 pages. That's about 65 pages per day, which feels right on target. But, how many words are there per page? According to this Google Answers answer, which offers a couple citations, a novel averages about 250 words per page.

But, readinglength.com shows that many books I've read are over 300 words per page, and some denser at 350 words per page, with the average sitting around 310. So 250 words per page at 65 pages per day is 16,250 words per day, and 310 words per page at 65 pages per day 20,150 pages that I'm reading.

Because I'm only reading about 2 hours per day, that means I'm reading at a meager 135 to 168 words per minute, based on the above stats. I guess I'm a slow reader.

If I highball it at 168 words per minute, then in 6 hours, I will have read 60,480 words. After a year of reading, that's 22,075,200 words. An independent blog post confirms this finding of 250-300 words per page, but also uses that to say that most adult books are 90,000 - 100,000 words in length (additional confirmation from earlier), and young adult novels target the 55,000 word length that Chu cited (maybe Chu likes reading young adult non-fiction?). As such, I can expect to read 22,075,200 words per year Ã· 100,000 words per book, or about 220 books in a year of reading 6 hours every day.

Bingo.

So, what can we realistically expect from reading?

Readers average 200 words per minute.

A page averages 250 words.

A novel averages 100,000 words.

One hour of reading per day can hit 30-40 books per year.

Six hours of reading per day can hit 200-250 books per year.

To read 1,000 books in a year, you need to read 22 hours per day.

This is reading average length adult non-fiction books at an average speed of 200 words per minute. The calculus completely changes if your average reading speed is faster than 200 wpm, you read primarily graphic novels with little text, or read shorter non-fiction novels. Fourth-grade chapter books? Yeah, I could read 1,000 of those in a year.

Password Best Practices I - The Generator

Aaron Toponce — Mon, 18 Sep 2017 13:00:46 +0000

This is the first in a series of posts about password best practices. The series will cover best practices from a few different angles- the generator targeted at developers creating those generators, the end user (you, mom, dad, etc.) as you select passwords for accounts from those generators, and the service provider storing passwords in the database for accounts that your users are signing up for.

Motivation

When end users are looking for passwords, they may turn to password generators, whether they be browser extensions, websites, or offline installable executables. Regardless, as a developer, you will need to ensure that the passwords you your providing for your users are secure. Unfortunately, that's a bit of a buzzword, and can be highly subjective. So, we'll motivate what it means to be "secure" here:

The generator is downloaded via HTTPS, whether it's a website, executable ZIP, or browser extension.

The generator uses a cryptographically secure random number generator.

The generator provides at least 70-bits of entropy behind the password.

The generator is open source.

The generator generates passwords client-side, not server-side.

The generator does not serve any ads or client-side tracking software.

I think most of us can agree on these points- the software should be downloaded over HTTPS to mitigate man-in-the-middle attacks. A cryptographically secure RNG should be used to ensure unpredictability in the generated password. In addition to that, the CRNG should also be uniformly distributed across the set, so no elements of the password are more likely to appear than any other. Creating an open source password generator ensures that the software can be audited for correctness and instills trust in the application. Generating passwords client-side, means the server hos now possible way of knowing what passwords were generated, unless the client is also calling home (the code should be inspected). And of course, we don't want any adware or malware installed in the password generating application to further compromise the security of the generator.

Okay. That's all well and good, but what about this claim to generate passwords from at least 70-bits in entropy? Let's dig into that.

Brute force password cracking

Password cracking is all about reducing possibilities. Professional password crackers will have access to extensive word lists of previously compromised password databases, they'll have access to a great amount of hardware to rip through the password space, and they'll employ clever tricks in the password cracking software, such as Hashcat or MDXFind, to further reduce the search space, to make finding the passwords more likely. In practice, 90% of leaked hashed password databases are reversed trivially. With the remaining 10%, half of that space takes some time to find, but those passwords are usually recovered. The remaining few, maybe 3%-5%, contain enough entropy that the password cracking team likely won't recover those passwords in a week, or a month, or even a year.

So the question is this- what is that minimum entropy value that thwarts password crackers? To answer this question, let's look at some real-life brute force searching to see if we can get a good handle on the absolute minimum security margin necessary to keep your client's leaked password hash out of reach.

Bitcoin mining

Bitcoin mining is the modern-day version of the 1849 California Gold Rush. As of right now, Bitcoin is trading at $3,665.17 per BTC. As such, people are fighting over each other to get in on the action, purchasing specialized mining hardware, called "Bitcoin ASICs", to find those Bitcoins as quickly as possible. These ASICs are hashing blocks of data with SHA-256, and checking a specific difficulty criteria to see if it meets the requirements as a valid Bitcoin block. If so, the miner that found that block is rewarded that Bitcoin and it's recorded in the never-ending, ever-expanding, non-scalable blockchain.

How many SHA-256 hashes is the word at large calculating? As of this writing, the current rate is 7,751,843.02 TH/s, which is 7,751,843,020,000,000,000 SHA-256 hashes per second. At one point, it peaked at 8,715,000 THps, and there is no doubt in my mind that it will pass 10,000,000 THps before the end of the year. So let's run with that value, of 10,000,000,000,000,000,000 SHA-256 hashes per second, or 10¹⁹ SHA-256 hashes per second.

If we're going to talk about that in terms of bits, we need to convert it to a base-2 number, rather than base-10. Thankfully, this is easy enough. All we need to calculate is the log₂(X) = log(X)/log(2). Doing some math, we see that Bitcoin mining is roughly flipping every combination of bits in a:

63-bit number every second.

69-bit number every minute.

74-bit number every hour.

79-bit number every day.

84-bit number every month.

88-bit number every year.

What does this look like? Well, the line is nearly flat. Here in this image, the x-axis is the number of days spent mining for Bitcoin, starting from 0 through a full year of 365 days. The y-axis is the search space exhaustion in bits. So, you can see that in roughly 45 days, Bitcoin mining have calculated enough SHA-256 hashes to completely exhaust an 85-bit search space (click to enlarge):

Real-world password cracking

That's all fine and dandy, but I doubt professional password crackers have access to that sort of hardware. Instead, let's look at a more realistic example.

Recently, Australian security researcher Troy Hunt, the guy that runs https://haveibeenpwned.com/, released a ZIP of 320 million SHA-1 hashed passwords that he's collected over the years. Because the passwords were hashed with SHA-1, recovering them should be like shooting fish in a barrel. Sure enough, a team of password crackers got together, and made mincemeat of the dataset.

In the article, it is mentioned that they had a peak password cracking speed of 180 GHps, or 180,000,000,000 SHA-1 hashes per second, or 18*10¹⁰ SHA-1 hashes per second. The article mentions that's the equivalent of 25 NVidia GTX1080 GPUs working in concert. To compare this to Bitcoin mining, the team was flipping every combination of bits in a:

41-bit number every second.

47-bit number every minute.

53-bit number every hour.

58-bit number every day.

63-bit number every month.

66-bit number every year.

As we can see, this is a far cry from the strength of Bitcoin mining. But, are those numbers larger than you expected? Let's see how it looks on the graph, compared to Bitcoin (click to enlarge):

So, it seems clear that our security margin is somewhere above that line. Let's look at one more example, a theoretical one.

Theoretical password cracking by Edward Snowden

Before Edward Snowden became known to the world as Edward Snowden, he was known to Laura Poitras as "Citizenfour". In emails back-and-forth between Laura and himself, he told her (emphasis mine):

"Please confirm that no one has ever had a copy of your private key and that it uses a strong passphrase. Assume your adversary is capable of one trillion guesses per second. If the device you store the private key and enter your passphrase on has been hacked, it is trivial to decrypt our communications."

But one trillion guesses per second is only about 5x the collective power of our previous example of a small team of password cracking hobbyists. That's only about 125 NVidia GTX1080 GPUs. Certainly interested adversaries would have more money on hand to invest in more computing power than that. So, let's increase the rate to 10 trillion guesses per second. 1,250 NVidia GTX1080 GPUs would cost our adversary maybe $500,000. A serious investment, but possibly justifiable, and certainly not outside the $10 billion annual budget of the NSA. So let's roll with it.

At 10¹³ password hashes per second, we are flipping every combination of bits in a:

43-bits every second.

49-bits every minute.

54-bits every hour.

59-bits every day.

64-bits every month.

68-bits every year.

Plotting this on our chart with both Bitcoin mining and clustered hobbyist password cracking, we see (click to enlarge):

The takeaway

What does all this math imply? That as a developer of password generator software, you should be targeting a minimum of 70-bits of entropy with your password generator. This will give your users the necessary security margins to steer clear of well-funded adversaries, should some service provider's password database get leaked to the Internet, and they find themselves as a target.

As a general rule of thumb, for password generator developers, these are the sort of security margins your can expect with entropy:

70-bits or more: Very secure.

65-69 bits: Moderately secure.

60-64 bits: Weakly secure.

59 bits or less: Not secure.

What does this mean for your generator then? This means that the number of size of the password or passphrase that you are giving users should be at least:

Base-94: 70/log₂(94)=11 characters

Base-64: 70/log₂(64)=12 characters

Base-32: 70/log₂(32)=14 characters

Base-16: 70/log₂(16)=18 characters

Base-10: 70/log₂(10)=22 characters

Diceware: 70/log₂(7776)=6 words

Now, there is certainly nothing wrong with generating 80-bit, 90-bit, or even 128-bit entropy. The only thing you should consider with this, is the size of the resulting password and passphrases. For example, if you were providing a minimum of 128-bit security for your users with the password generator, then things would look like:

Base-94: 128/log₂(94)=20 characters

Base-64: 128/log₂(64)=22 characters

Base-32: 128/log₂(32)=26 characters

Base-16: 128/log₂(16)=32 characters

Base-10: 128/log₂(10)=39 characters

Diceware: 128/log₂(7776)=10 words

As you can see, as you increase the security for your users, the size of the generated passwords and passphrases will also increase.

Conclusion

It's critical that we are doing right by our users when it comes to security. I know Randall Munroe of XKCD fame created the "correct horse battery staple" comic, advising everyone to create 4-word passphrases. This is fine, provided that those 4 words meets that minimum 70-bits of entropy. In order for that to happen though, the word list needs to be:

4 = 70/log₂(x) => 4 = 70/log(x)/log(2) => 4 = 70*log(2)/log(x) => 4*log(x) = 70*log(2) => log(x) = 70/4*log(2) => x = 10^70/4*log(2) => x ~= 185,364

You would need a word list of at least 185,364 words to provide at least 17.5-bits of entropy per word, which brings us to required 70-bits of total entropy for 4 words. All too often, I see generators providing four words, but the word list is far too small, like around Diceware size, which is only around 51-bits of entropy. As we just concluded, that's not providing the necessary security for our users.

So, developers, when creating password and passphrase generators, make sure they are at least targeting the necessary 70-bits of entropy, in addition to the other qualifications that we outlined at the beginning of this post.

Colorful Passphrases

Aaron Toponce — Fri, 15 Sep 2017 13:00:03 +0000

Since the development of my passphrase and password generator, I started working toward improving the other online generators out there on the web. I created a Google Spreadsheet to work toward that goal, by doing reasonable audits to "rank" each generator, and see how they stacked up against the rest. Then, I started submitting patches in hopes of making things better.

One passphrase generator that was brought to my attention was Pass Plum. Pass Plum supplies an example word list to use for generating your passphrases, if you choose to install the software on your own server. Unfortunately, the list is only 140 words in size, so if you choose to use that for your word list, then you only get about 7.13-bits of entropy per word. Sticking to the default configuration 4 words given to the user, that's a scant 28-bits of security on your passphrase, which is trivially reversed. I submitted a pull request to extend it to 4,096 words, providing exactly 13-bits of entropy per word, or about 52-bits of entropy for a 4-word passphrase- a significant improvement.

I noticed, however, that the default list was nothing but color names, and that got me thinking- what if not only the generator provided color names for passphrases, but also colored the word that color name? Basically, a sort of false visual synesthesia. What I want to know is this, is it easier to remember passphrases when you can associate each word with a visual color?

So, over the past several nights, and during weekends, I've been putting this together. So, here is is- colorful passphrases.

Head over to my site to check it out. If a color is too light (its luma value is very high), then the word is outlined with CSS. Every word is bold, to make the word even more visible on the default white background.

As I mentioned, the idea is simple: people struggle remembering random meaningless strings of characters for passwords, so passphrases are a way to make a random series of words easier to recall. After all, it should be easier to remember "gnu hush gut modem scamp giddy" than it is to remember "$5hKXuE[\NK". It's certainly easier to type on mobile devices, and embedded devices without keyboards, like smart TVs and video game consoles.

But, even then, there is nothing that is really tying "gnu hush gut modem scamp giddy" together, so you force yourself in some sort of mnemonic to recall it. Visually stimulated color passphrases have the benefit of not only using a mnemonic to recall the phrase, but an order of colors as well. For example, you might not recall "RedRobin Pumpkin Revolver DeepPuce Lucky Crail TealDeer", but you may remember its color order of roughly "red orange black purple gold brown teal". "A RedRobin is red. A pumpkin is orange. A revolver (gun) is black. DeepPuce is a purple. Lucky coins are gold. Crail, Soctand has brown dirt. TealDeer are teal."

However, it also comes with a set of problems. First, what happens if you actually have visual synesthesia? Will seeing these colors conflict with your mental image of what the color should be for that word? Second, many of the words are very obscure, such as "Crail" or "Tussock" or "Tuatara" (as all seen in the previous screenshot collage). Finally, what happens when you have a color passphrase where two similar colors are adjacent to each other? Something like "Veronica Affair Pipi DeepOak Atoll BarnRed RedOxide"? Both "BarnRed" and "RedOxide" are a deep reddish color. Will it be more difficult to recall which comes first?

As someone who is interested in password research, I wanted to see what sort of memory potential visually colorful passphrases could have. As far as I know, this has never been investigated before (at least I could find any research done in this area, and I can't find any passphrase generators doing it). This post from Wired investigates alternatives to text entry for password support, such as using color wheels, but doesn't say anything about visual text. Here is a browser extension that colors password form fields on websites, with the SHA-1 hash of your password as you type it. You know if it's correct, by recognizing if the pattern is the same it always is when logging in.

Long story short, I think I'm wading into unknown territory here. If you find this useful, or even if you don't, I would be very interested in your feedback.

A Practical and Secure Password and Passphrase Generator

Aaron Toponce — Mon, 04 Sep 2017 17:59:59 +0000

The TL;DR

Go to https://ae7.st/g/ and check out my new comprehensive password and passphrase generator. Screenshots and longer explanation below.

Introduction

Sometime during the middle of last summer, I started thinking about password generators. The reason for this, was that I noticed a few things when I used different password generators, online or offline:

The generator created random meaningless strings.

The generator created XKCD-style passphrases.

The generator gave the user knobs and buttons galore to control

Uppercase characters

Lowercase characters

Digits

Nonalphanumeric characters

Pronounceable passwords

Removing ambiguous characters

Password Length

The Problem

Here is just one example of what I'm talking about:

This password generator has a lot of options for tweaking your final password.

Ever since Randal Munroe published https://xkcd.com/936/, people started creating "XKCD-style" passphrase generators. Here's a very simple one that creates a four-word passphrase. No knobs, bells, or whistles. Just a button to generate a new XKCD passphrase. Ironically, the author provides an XKCD passphrase generator for you to use, then tells you not to use it.

On the other hand, why not make the XKCD password generation as complex as possible? Here at https://xkpasswd.net/s/, not only do you have an XKCD password generator, but you have all the bells, whistles, knobs, buttons, and control to make it as ultimately complex as possible. Kudos to the generator even make entropy estimates about the generated passwords!

Why not add all the complexity of password generation to XKCD passwords?

What bothers me about the "XKCD password" crowd, however, is that no one knows that Diceware was realized back in 1995, making passphrases commonplace. Arnold Reinhold created a list of 7,776 words, enough for every combination of a 6-sided die rolled 5 times. Arnold explains that the passphrase needs to be chosen from a true random number generator (thus the dice) and as a result each word in the list will have approximately 12.9-bits of entropy. Arnold recommends throwing the dice enough times to create a five-word Diceware passphrase. That would provide about 64-bits of entropy, a modestly secure result.

A five-word Diceware passphrase could be:

soot laid tiger rilly feud pd

31 al alibi chick retch bella

woven error rove pliny dewey quo

My Solution

While these password generators are all unique, and interesting, and maybe even secure, it boils down to the fact that my wife, never mind my mom or grandma, isn't going to use them. They're just too complex. But worse, they give the person using them a false sense of security, and in most cases, they're not secure at all. I've talked with my wife, family, and friends about what it requires to have a strong password, and I've asked them to give me examples. You can probably guess what I got.

Spouse's first name with number followed by special character. EG: "Alan3!"

Favorite sports team in CamelCase. EG: "UtahUtes"

Keyboard patterns. EG: "qwertyasdf"

The pain goes on and on. Usually, the lengths of each password is somewhere around 6-7 characters. However, when you start talking about some of these generators, and they see passwords like "(5C10#+b" or "V#4I5'4c", their response is usually "I'm never going to remember that!". Of course, this is a point of discussion about password managers, but I'll save that for another post.

So I wanted to create a password and passphrase generator that met everyone's needs:

Simplicity of use

Length and complexity

Provably secure

Desktop and mobile friendly

If you've been a subscriber to my blog, you'll know that I post a lot about Shannon entropy. Entropy is maximized when a uniform unbiased random function controls the output. Shannon entropy is just a fancy way for estimating the total number of possibilities something could be, and it's measured in bits. So, when I say a Diceware passphrase as approximately 64-bits of entropy, I'm saying that the passphrase that was generated is 1 in 2^64 or 18,446,744,073,709,551,616 possibilities. Again, this is only true if the random function is uniform and unbiased.

So, I built a password generator around entropy, and entropy only. The question became, what should the range be, and what's my threat model? I decided to build my threat model after offline brute force password cracking. A single computer with a few modest GPUs can work through every 8-character password built from all 94 graphical characters on the ASCII keyboard hashed with SHA-1 in about a week. That's 94^8 or 6,095,689,385,410,816 total possibilities. If chosen randomly, Shannon entropy places any password built from that set at about 52-bits. If the password chosen randomly from the same set of 94 graphical characters was 9 characters long, then the password would have about 59-bits of Shannon entropy. This would also take that same GPU password cracking machine 94 weeks to fully exhaust every possibility.

This seemed like a good place to start the range. So, for simplicity sake, I started the entropy range at 55-bits, then incremented by 5 bits until the maximum of 80-bits. As you can see from the screenshot of the entropy toolbar, 55-bits is red as we are in dangerous territory of an offline password cracker with maybe a cluster of GPUs finding the password. But things get exponentially expensive very quickly. Thus, 60-bits is orange, 65-bits is yellow, and 70-bits and above are green. Notice that the default selection is 70-bits.

The entropy toolbar of my password generator, with 70-bits as the default.

When creating the generator, I realized that some sites will have length restrictions on your password, such as not allowing more than 12 characters, or not allowing certain special characters, or forcing at least one uppercase character and one digit, and so forth. Some service providers, like Google, will allow you any length with any complexity. But further, people remember things differently. Some people don't need to recall the passwords, as they are using password managers on all their devices, with a synced database, and can just copy/paste. Others want to remember the password, and others yet want it easy to type.

So, it seemed to me that not only could I build a password generator, but also a passphrase generator. However, I wanted this to be portable, so rather than create a server-side application, I made a client-side one. This does mean that you download the wordlists as you need them to generate the passphrases, and the wordlists are anything but light. However, you only download them as you need them, rather than downloading all on page load.

To maximize Shannon entropy, I am using the cryptographically secure pseudorandom number generator from the Stanford Javascript Crypto Library. I'm using this, rather than the web crypto API, because I use some fairly obscure browsers, that don't support it. It's only another 11KB download, which I think is acceptable. SJCL does use the web crypto API to seed its generator, if the browser supports it. If not, a entropy collector listener event is launched, gathering entropy from mouse movements. The end result, is that Shannon entropy is maximized.

Passphrases

There are 5-types of passphrases in my generator:

Alternate

Bitcoin

Diceware

EFF

Pseudowords

Diceware

For the Diceware generator, I support all the languages that you'll find on the main Diceware page, in addition to the Beale word list. As of this writing, that's Basque, Bulgarian, Catalan, Chinese, Czech, Danish, Dutch, English, Esperanto, Finnish, French, German, Italian, Japanese (Romaji), Maori, Norwegian, Polish, Portuguese, Russian, Slovenian, Spanish, Swedish, and Turkish. There are 7,776 words in each word list, providing about 12.9248-bits of entropy per word.

EFF

For the EFF generator, I support the three word lists that the EFF has created- the short word list, the long word list, and the "distant" word list, where every work has an edit distance of at least three from the others in the list. The long list is similar to the Diceware list, in that it is 7,776 words providing about 12.9248-bits of entropy per word. However, the number of characters in each word in the word list are longer on average, at around 7 characters per word than the Diceware word list, at around 4.3 characters per word. So, for the same entropy estimate, you'll have a longer EFF passphrase than a Diceware passphrase. The short word list contains only 1,296 words, to be used with 4 dice, instead of 5, and the maximum character length of any word is 5 characters. The short word list provides about 10.3399-bits of entropy per word. Finally, the "distant" word list is short in number of words also at 1,296 words, but longer in character count, averaging 7 characters per word.

Bitcoin

For the Bitcoin generator, I am using the BIP-0039 word lists to create the passphrase. These lists are designed to be a mnemonic code or sentence for generating deterministic Bitcoin wallets. However, because they are a list of words, they can be used for building passphrases too. Each list is 2,048 words, providing exactly 11-bits of entropy per word. Like Diceware, I support all the languages of the BIP-0039 proposal, which as of this writing includes Simplified Chinese, Traditional Chinese, English, French, Italian, Japanese (Hiragana), Korean (Hangul), and Spanish.

Alternate

Elvish

In the Alternate generator, I have a few options that provide various strengths and weaknesses. The Elvish word list is for entertainment value only. The word list consists of 7,776 words, making it suitable for Diceware, and provides about 12.9248-bits of entropy per word. However, because the generator is strictly electronic, and I haven't assigned dice roll values to each word, I may bump this up to 8,192 words providing exactly 13-bits of entropy per word. The word list was built from the Eldamo lexicon.

Klingon

Another passphrase generator for entertainment value is the Klingon generator. This word list comes from the Klingon Pocket Dictionary, and my word list provides exactly 2,604 unique words from the 3,028 words in the Klingon language. Thus, each word provides about 11.3465-bits of entropy.

PGP

The PGP word list was created to make reading hexadecimal strings easier to speak and phonetically unambiguous. It comprises of exactly 256 words providing exactly 8-bits of entropy per word. This generator works well in noisy environments, such as server rooms, where passwords need to be spoken from one person to another to enter into a physical terminal.

Simpsons

The Simpson's passphrase generator consists of 5,000 words, providing about 12.2877-bits of entropy per word. The goal of this generator is not only educational to show that any source of words can be used for a password generator, such as a television series of episodes, but also more memorable. Because this list contains the most commonly spoken 5,000 words from the Simpson's episodes, a good balance of verbs, nouns, adjectives, etc. are supplied. As such, the generated passphrases seem to be easier to read, and less noun-heavy than the Diceware or EFF word lists. These passphrases may just be the easiest to recall, aside from the Trump word list.

Trump

And now my personal favorite. The Trump generator was initially built for entertainment purposes, but ended up having the advantage of providing a good balanced passphrase of nouns, verbs, adjectives, etc. much like the Simpson's generator. As such, these passphrases may be easier to recall, because they are more likely to read as valid sentences than the Diceware or EFF generators. This list is pulled from Donald J. Trump's Twitter account. The list is always growing, currently at 5,343 words providing about 12.3404-bits of entropy per word.

Pseudowords

The pseudowords generator is a cross between unreadable/unpronounceable random strings and memorable passphrases. They are pronounceable, even if the words themselves are gibberish. They are generally shorter in practice than passphrases, and longer than pure random strings. The generators are here to show what you can do with random pronounceable strings.

Bubble Babble

Bubble Babble is a hexadecimal encoder, with builtin checksumming, initially created Antti Huima, and implemented in the original proprietary SSH tool (not the one by the OpenSSH developers). Part of the specification is that every encoded string begins and ends with "x". However, rather than encode data from the RNG, it is randomly generating 5-characters words in the syntax of "". As such, each 5-character word, except for the end points, provides 21521521=231,525 unique combinations, or about 17.8208-bits of entropy. The end points are in the syntax of "x" or "x, which is about 21521*5=11,025 unique combinations, or about 13.4285-bits of entropy.

Secret Ninja

This generator comes from a static character-to-string assignment that produces pronounceable Asian-styled words. As such, there are only 26 assignments, providing about 4.7004-bits of entropy per string. There are three strings concatenated together per hyphenated word.

Cosby Bebop

I was watching this YouTube video with Bill Cosby and Stewie from Family Guy, and about half-way through the skit, Bill Cosby starts using made-up words as part of his routine. I've seen other skits by comedians where they use made-up words to characterize Bill Cosby, so I figured I would create a list of these words, and see how they fell out. There are 32 unique words, providing exactly 5-bits of entropy per word. Unlike the Bubble Babble and Secret Ninja generators, this generator uses both uppercase and lowercase Latin characters.

Korean K-pop

In following with the Bill Cosby Bebop generator, I created a Korean "K-pop" generator that used the 64-most common male and female Korean names, providing exactly 6-bits of entropy per name. I got the list of names from various sites listing common male and female Korean names.

Random

These are random strings provided as a last resort for sites or accounting software that have very restrictive password requirements. These passwords will be some of the shortest generated while meeting the same minimum entropy requirement. Because these passwords are not memorable, they should be absolutely stored in a password manager (you should be using one anyway).

Base-94: Uses all graphical U.S. ASCII characters (does not include horizontal space). Each character provides about 6.5546-bits of entropy. This password will contain ambiguous characters.

Base-64- Uses all digits, lowercase and uppercase Latin characters, and the "+" and "/". Each character provides exactly 6-bits of entropy. This password will contain ambiguous characters.

Base-32: Uses the characters defined in RFC 4648, which strives to use an unambiguous character set. Each character provides exactly 5-bits of entropy.

Base-16: Uses all digits and lowercase characters "a" through "f". Each character provides exactly 4-bits of entropy. This password will contain fully unambiguous characters.

Base-10: Uses strictly the digits "0" through "9". This is mostly useful for PINs or other applications where only digits are required. Each digits provides about 3.3219-bits of entropy. This password will contain fully unambiguous characters.

Emoji: There are 881 emoji glyphs provided by that font, yielding about 9.7830-bits per glyph. One side-effect, is that even though there is a character count in the generator box, each glyph may be more than 1 byte, so some input forms may count that glyph as more than 1 character. Regardless, the minimum entropy is met, so the emoji password is still secure.

I want to say something a bit extra about the Emoji generator. With the rise of Unicode and the UTF-8 standard, and the near ubiquitous popularity of smartphones and mobile devices, having access to non-Latin character sets is becoming easier and easier. As such, password forms are more likely supporting UTF-8 on input to allow Cyrillic, Coptic, Arabic, and East Asian ideographs. So, if Unicode is vastly becoming the norm, why not take advantage of it while having a little fun?

I opted for the black-and-white font, as opposed to the color font, to stay consistent with the look and feel of the other generators. This generator uses the emoji character sets provided by Google's Noto Emoji fonts, as that makes it easy for me to support the font in CSS 3, allowing every browser that supports CSS 3 to take advantage of the font and render the glyphs in a standard fashion. The license is also open so that I can redistribute the font without paying royalties, and others can do the same.

Screenshots

The post wouldn't be complete without some screenshots. The generator is both desktop friendly, fitting comfortably in a 1280x800 screen resolution, as well a mobile friendly, working well on even some of the oldest mobile devices.

Desktop screenshot.

First mobile screenshot.

Second mobile screenshot.

Random Passphrases Work, Even If They're Built From Known Passwords

Aaron Toponce — Thu, 03 Aug 2017 23:40:23 +0000

Just this morning, security researcher Troy Hunt released a ZIP containing 306 million passwords that he's collected over the years from his ';--have i been pwned? service. As an extension, he created a service to provide either a password or a SHA-1 hash to see if your password has been pwnd.

In 2009, the social network RockYou was breached, and 32 million accounts of usernames and passwords was released to the public Internet. No doubt those 32 million passwords are included in Troy Hunt's password dump. What's interesting, and the point of this post, is individually, each password from the RockYou breach will fail.

However, what would happen if you took 6 random RockYou passwords, and created a passphrase? Below is screenshot demonstrating just that using the above 6 randomly chosen RockYou passwords. Individually, they all fail. Combined, they pass.

Now, to be fair, I'm choosing these passwords from my personalized password generator. The list is the top 7,776 passwords from the 32 million RockYou dump. As such, you could use this list as a Diceware replacement with 5 dice. Regardless, each password is chosen at random from that list, and enough passwords are chosen to reach a 70-bits of entropy target, which happen to be 6 passwords. Mandatory screenshot below:

The point of this post, is that you can actually build a secure password for sites and services using previously breached passwords for your word list source, in this case, RockYou. The only conditions is that you have a word list large enough create a reasonable passphrase with few selections, and that the process picking the passwords for you is cryptographically random.

Electronic Slot Machines and Pseudorandom Number Generators

Aaron Toponce — Fri, 17 Feb 2017 19:22:53 +0000

TL;DR

An Austrian casino company used a predictable pseudorandom number generator, rather than a cryptographically secure one, and people are taking advantage of it, and cashing out big.

The Story

Wired reported on an article about an amazing operation at beating electronic slot machines, by holding your phone to the slot machine screen for a time while playing, leaving the slot machine, then coming back an additional time, and cashing in big.

Unlike most slots cheats, he didnâ€t appear to tinker with any of the machines he targeted, all of which were older models manufactured by Aristocrat Leisure of Australia. Instead heâ€d simply play, pushing the buttons on a game like Star Drifter or Pelican Pete while furtively holding his iPhone close to the screen.

Heâ€d walk away after a few minutes, then return a bit later to give the game a second chance. Thatâ€s when heâ€d get lucky. The man would parlay a $20 to $60 investment into as much as $1,300 before cashing out and moving on to another machine, where heâ€d start the cycle anew.

These machines were made by Austrian company Novomatic, and when Novomatic engineers learned of the problem, after a deep investigation, the best thing they could come up with, was that the random number generator in the machine was predictable:

Novomaticâ€s engineers could find no evidence that the machines in question had been tampered with, leading them to theorize that the cheaters had figured out how to predict the slotsâ€ behavior. â€œThrough targeted and prolonged observation of the individual game sequences as well as possibly recording individual games, it might be possible to allegedly identify a kind of â€˜patternâ€ in the game results,â€ the company admitted in a February 2011 notice to its customers.

The article, focused on a single incident in Missouri, mentions that the state vets the machines before they go into production:

Recognizing those patterns would require remarkable effort. Slot machine outcomes are controlled by programs called pseudorandom number generators that produce baffling results by design. Government regulators, such as the Missouri Gaming Commission, vet the integrity of each algorithm before casinos can deploy it.

On random number generators

I'll leave you to read the rest of the article. Suffice it to say, the Novomatic machines were using a predictable pseudorandom number generator after observing its output for a period of time. This poses some questions that should immediately start popping up in your head:

What is the vetting process by states to verify the quality of the pseudorandom number generators in solt machines?

Who is on that vetting commission? Is it made up of mathematicians and cryptographers? Or just a board of executives and politicians?

Why aren't casino manufacturers using cryptographically secure pseudorandom number generators?

For me, that third item is the most important. No doubt, as the Wired article states, older machines just cannot be fixed. They need to be taken out of production. So long as they occupy casinos, convenience stores, and gas stations, they'll be attacked, and the owner will lose money. So let's talk about random number generators for a second, and see what the gambling industry can do to address this problem.

You can categorize random number generators into four categories:

Nonsecure pseudorandom

Cryptographically secure pseudorandom

Chaotic true random

Quantum true random

What I would be willing to bet, is that most electronic machines out there are of the "nonsecure pseudorandom" type of random number generator, and Novomatic just happened to pick a very poor one. Again, there likely isn't anything they can do about existing machines in production now, but what can they do moving forward? They should start using cryptographically secure pseudorandom number generators (CSPRNGs).

In reality, this is trivial. There are plenty of CSPRNGs to choose from. CSPRNGs can be broken down further into three subcategories:

Designs based on cryptographic primitives.

Number theoretic designs.

Special-purpose designs.

Let's look at each of these in turn.

Designs based on cryptographic primitives.

These are generators that use things like block ciphers, stream ciphers, or hashing functions for the generator. There are some NIST and FIPS standardized designs:

NIST SP 800-90A rev. 1 (PDF): CTR_DRBG (a block cipher, such as AES in CTR mode), HMAC_DRBG (hash-based message authentication code), and Hash_DRBG (based on cryptographically secure hashing functions such as SHA-256).

ANSI X9.31 Appendix A.2.4: This is based on AES, and obsoletes ANSI X9.17 Appendix C, which is based on 3DES. It requires a high-precision clock to initially seed the generator. It was eventually obsoleted by ANSI X9.62-1998 Annex A.4.

ANSI X9.62-2005 Annex D: This standard is defines an HMAC_DRBG, similar to NIST SP 800-90A, using an HMAC as the cryptographic primitive. It obsoletes ANSI X9.62-1998 Annex A.4, and also requires a high-precision clock to initially seed the generator.

It's important that these designs are backtracking resistant, meaning that if you know the current state of the RNG, you cannot construct all previous states of the generator. The above standards are backtracking resistant.

Number theoretic designs

There are really only two current designs, that are based on either the factoring problem or the discrete logarithm problem:

Blum-Blum-Shub: This is generator based on the fact that it is difficult to compute the prime factors of very large composites (on the order of 200 or more digits in length). Due to the size of the prime factors, this is a very slow algorithm, and not practical generally.

Blum-Micali: This is a generator based on the discrete logarithm problem, when given two known integers "b" and "g", it is difficult to find "k" where "b^k = g". Like Blum-Blum-Shub, this generator is also very slow, and not practical generally.

Special-purpose designs

Thankfully, there are a lot of special purpose designs designed by cryptographers that are either stream ciphers that can be trivially ported to a CSPRNG, or deliberately designed CSPRNGs:

Yarrow: Created by cryptographer Bruce Schneier (deprecated by Fortuna)

Fortuna: Also created by Bruce Schneier, and obsoletes Yarrow.

ISAAC: Designed to address the problems in RC4.

ChaCha20: Designed by cryptographer Daniel Bernstein, our crypto Lord and Savior.

HC-256: The 256-bit alternative to HC-128, which is part of the eSTREAM portfolio.

eSTREAM portfolio: (7 algorithms- 3 hardware, 4 software)

Random123 suite: Contains four highly parallelizable counter-based algorithms, only two of which are cryptographically secure.

The solution for slot machines

So now what? Slot machine manufacturers should be using cryptographically secure algorithms in their machines, full stop. To be cryptographically secure, the generator:

Must past the next-bit test (you cannot predict the next bit any better than 50% probability).

Must withstand a state compromise (you cannot reconstruct past states of the generator based on the current state).

If those two properties are met in the generator, then the output will be indistinguishable from true random noise, and the generator will be unbiased, not allowing an adversary, such as someone with a cellphone monitoring the slot machine, to get the upperhand on the slot machine, and prematurely cash out.

However, the question should then be raised- "How do you properly seed the CSPRNG, so it starts in an unpredictable state, before release?" Easy, you have two options here:

Seed the CSPRNG with a hardware true RNG (HWRNG), such as a USB HWRNG, or....

Build the machine such that it collects environmental noise as entropy

The first point is much easier to achieve than the second. Slot machines likely don't have a lot of interrupts built into the system-on-a-chip (SoC). So aside from a microphone, video camera, or antenna recording external events, you're going to be hard-pressed to get any sort of high-quality entropy into the generator. USB TRNGs are available all over the web, and cheap. When the firmware is ready to be deployed, read 512-bits out of the USB generator, hash it with SHA-256, and save the resulting hash on disk as an "entropy file".

Then all that is left is when the slot machine boots up and shuts down:

On startup, read the "entropy file" saved from the previous shutdown, to seed the CSPRNG.

On shutdown, save 256-bits of data out of the generator to disk as an "entropy file".

This is how most operating systems have solved the problem with their built-in CSPRNGs. Provided that the very first "entropy file" was initially seeded with a USB true HWRNG, the state of every slot machine will be always be different, and will always be unpredictable. Also, 256-bits is more than sufficient to make sure the initial state of the generator is unpredictable; physics proves it.

Of course, the SoC could have a HWRNG onboard, but then you run the risk of hardware failure, and the generator becoming predictable. This risk doesn't exist with software-based CSPRNGs, so provided you can always save the state of the generator on disk at shutdown, and read it on startup, you'll always have an unpredictable slot machine.

Adblockers Aren't Part Of The Problem- People Are

Aaron Toponce — Wed, 30 Nov 2016 15:06:44 +0000

Troy Hunt, a well-respected security researcher, and public speaker, wrote a blog post recently about how adblockers are part of the bad experience of the web. His article is about a sponsorship banner he posts at the top of his site, just below the header. It's not flashy, intrusive, loud, obnoxious, or a security or privacy concern. He gets paid better for the sponsorship strip than he does for ads, and the strip is themed with the rest of his site. It's out of the way of the site content, and scrolls with the page. In my opinion, it's in perfectly good taste. See for yourself:

Troy was surprised to find out, however, that his sponsorship strip is not showing when AdBlock Plus or UBlock Origin ad blockers are installed and enabled in the browser. He is understandably upset, as he is avoiding everything that piss off the standard web user when it comes to ads. He reached out to ABP about whitelisting his strip, and they've agreed it's hardly violating web user experience. However, someone added it to the EasyList filters, which means any ad blocker outside of ABP, will filter the sponsorship strip.

So, here's my question- are users wrong in filtering it?

Let's look at the state of web ads over the past couple decades. First, there was the ad popup, where the web page you were visiting would popup an ad right in front of the page. Sometimes they were difficult to close, and sometimes closing one would open a different one. Some pages would open dozens of popups, some fullscreen. It wasn't long before browsers across the board blocked popups by default, baked right into the browser.

After popups were unanimously blocked across every browser, advertisers turned to ad banners. These were just as obnoxious as the popups, even if you didn't have to close a window. The flashed, blinked, falsely promised free trips and gadgets, and even sometimes auto-played videos. They were rarely relevant to the site content, but web page owners were promised a revenue per click, regardless. So, the more you could fit on the page, the more likely someone would click on an ad, and you would get paid. Web page owners placed these obnoxious ads above the header, below the header, in the sidebars, in the middle of the pages breaking up paragraphs in posts, in the footers. In some cases, the screen real estate dedicated to ads was more than the actual content on the site.

Some HTML5 and CSS3 solutions now include overlays, that have to be manually closed or escaped, in order to continue parsing the site content. Unfortunately, ad blockers don't do a great job at blocking these. While they're great at finding and filtering out elements, blocking CSS overlay popups seems to be too difficult, as they are prevalent on the web, much to the chagrin of many ad block users.

Ad blockers then became a mainstay. Web users were pissed off due to Flash crashing the browser (most ads were Flash-based), slowing down their connection to download additional content (at the time, most were on dial-up on slow DSL), and in general just getting in the way. It got so bad, that DoubleClick's "privacy chief" wrote a rant about ad blockers, and how they were unethical, and ad blocker users were stealing revenue.

As web page analytics started becoming a thing, more and more website owners wanted to know how traffic was arriving at their site, so they could further increase that traffic, and in addition, increase ad revenue. Already, page counters like StatCounter existed, to help site owners understand partially how traffic was hitting them, where they came from, what time, what search engine they used, how long they stayed, etc. Well, advertisers started putting these analytics in their ads. So, not only did the website owner know who you were, the advertising company did too. And worse, while the website owner might not be selling that tracking data, the advertiser very likely is.

The advertiser also became a data broker.

But here's the tricky part- ad blocking was no longer enough. Now website owners were adding JavaScript trackers to their HTML. They're not visible on the page, so the ad blocker isn't hiding an element. It's not enough to block ads any longer. Privacy advocates begin warning about "browser fingerprinting" based on the specific details in your browser that can uniquely identify you. Those unique bits are then tracked with these tracking scripts, and set to advertisers and data brokers, which change many hands along the way. The EFF created a project to help users understand how unique they appeared on the web through the Panopticlick Project.

As a result, other browser extensions dedicated to blocking trackers started showing up. Things like Ghostery, Disconnect, Privacy Badger, and more. Even extensions that completely disable JavaScript and Flash became popular. Popular enough, that browsers implemented a "click-to-play" setting, where flash and other plugin content was blocked by default, and you would need to click the element to display it. It's not uncommon now to visit a web page where you tracking blocker will block a dozen or more trackers.

I wish I could stop here, but alas, many advertisers have made a turn for the ugly. Now, web ads are the most common way to get malware installed on your computer. Known as "malvertising", it is more common at infecting your computer than shady porn sites. Even more worrisome, is that this trend is shifting away from standard desktops to mobile. Your phone is now more valuable than your desktop, and advertisers know it. Never mind shady apps that compromise your device, ads in legitimate "safe" apps are compromising devices as well.

So, to summarize, the history of ads has been:

Annoying popups.

Annoying banners.

Annoying CSS overlays.

Transparent trackers.

Malvertising.

So, to Troy Hunt, here's my question: Given the awful history of advertisements on the web, are you honestly surprised that users don't trust a sponsorship strip?

Consider the following analogy: Suppose I brought a bunch of monkeys to your home, and they trashed the place. Smashed dishes, tore up furniture, destroyed computers and televisions, ruined floors, broke windows, and generally just destroyed anything and everything in sight. Then, after cleaning the place up, not only do I bring the monkeys back, but this time, they have digital devices (cameras, microphones, etc.) that report back to me about what your house looks like, where you live, what you're doing in response to the destruction. Again, you kick them out, clean up the place, and I return with everything as before, with some of them carrying a contagious disease that can get you and your family sick. I mean, honestly, one visit of these monkeys is enough, but they've made three visits, each worse than before.

Now, you show up at my doorstep, with a well-trained, leashed, groomed, clean, tame monkey, and I'm supposed to trust that it isn't anything like the past monkeys I've experienced before? As tame as it may be, call me rude, but I'm not trusting of monkeys right now. I've installed all sorts of alarm and monitoring systems, to warn me when monkeys are nearby, and nuke them with lasers. I've had too many bad experiences with monkeys in the past, to trust anyone bringing a new monkey to the premises.

So, you can see, it's not ad blockers that are the problem. It's the people behind the advertising firms and it's the people not trusting the Internet. The advertising c-level executives are trying to find ways to get their ad in front of your eyes, and are using any sort of shady means necessary to do it. The average web user is trying to find ways to have a pleasant experience on the web, without getting tracked, infected with malware, shouted at by a video, while still being able to consume the content.

People arguably don't trust ads. The people in the advertising firms have ruined that trust. You may have a clean privacy-aware non-intrusive sponsorship strip, but you can't blame people for not trusting it. We've just had too long of a history of bad ad experiences. So, while reaching out to the ad blocker developers to whitelist the sponsorship strip is a good first step, ultimately, if people don't trust it, and want to block, you can't blame them. Instead, continue focusing on what makes you successful, for your revenue from the ad blockers- blogging, speaking, developing, engaging. Your content, who you are, how you handle yourself is your most valuable ad.

Breaking HMAC

Aaron Toponce — Fri, 29 Jul 2016 13:56:46 +0000

Okay. The title might be click bait, just a little, but after you finish reading this post, I think you'll be a bit more careful picking your HMAC keys. After learning this, I know I will be. However, HMAC is not broken. It just has an interesting ... property that's worth knowing about.

First off, let's remind ourselves what HMAC is. HMAC, or Hashed Message Authentication Codes, are the ability to authenticate a cryptographic message. This is done through an asymmetric key agreement protocol, such as Diffie-Hellman, where two parties securely share symmetric keys. These keys are used to encrypt messages as well as authenticate data. HMAC tags prevent chosen plaintext attacks, where the attacker can insert malicious data into the payload (send $1,000,000 to my account), and HMAC tags prevent adaptive chosen ciphertext attacks where the attacker can send encrypted data to the server, and learn what is being protected (is "password" in the payload?).

Authenticated messages are absolutely essential to modern cryptographic software. If you're writing cryptographic software, and you're not authenticating your ciphertext, you're doing it wrong. It doesn't matter if the data is at rest or in motion. Authenticate your ciphertext. This is where HMAC fits in.

So, best practice, when not using native AEAD ciphers, such as AES-GCM, or ChaCha20-Poly1305, is to encrypt the plaintext then authenticate it with HMAC, and prepend or append the digest (called a "MAC tag" or just "tag") to the ciphertext.

Something like this pseudocode:

ciphertext = AES-256-CTR(nonce, plaintext, key1) tag = HMAC-SHA-512(ciphertext, key2) ciphertext = tag || ciphertext

Then we ship off the newly bundled ciphertext and MAC tag. When it arrives at our destination, the recipient can verify if the ciphertext is what it should be, before decrypting it, by verifying the MAC tag, which is the whole point:

tag1 = ciphertext[:64] data = ciphertext[64:] tag2 = HMAC-SHA-512(ciphertext, key2) hmac_check = 0 for char1, char2 in zip(tag1, tag2): hmac_check |= ord(char1) ^ ord(char2) if hmac_check == 0: plaintext = AES-256-CTR(nonce, data, key1) else: return False return True

Notice that we're doing constant time comparison of the shipped tag and the calculated tag. Last thing we want is to introduce a timing attack by doing "if tag1 != tag2". However, after the constant time comparison, if tag1 and tag2 match, we can decrypt the data, and retrieve the plaintext.

So, now you have the background on HMAC, let's look at some examples with Python, and see where HMAC breaks. After all, that's the click bait title, no?

1
2
3
4
5
6
7
>>> import os
>>> import hmac
>>> import hashlib
>>> key = os.urandom(16)
>>> msg = os.urandom(256)
>>> hmac.new(key, msg).hexdigest()
'd5a94f051b1e6ff67065b6f4c3a60130'

In this example, the default HMAC is HMAC-MD5. HMAC-MD5 is still considered cryptographically secure, even though vanilla MD5 is broken. Regardless, it'll suffice for this example, and we'll look at SHA-1 and SHA-2 also.

In RFC 2104, where HMAC is standardized, section 3 has this odd little tidbit (emphasis mine):

The key for HMAC can be of any length (keys longer than B bytes are
first hashed using H). However, less than L bytes is strongly
discouraged as it would decrease the security strength of the
function. Keys longer than L bytes are acceptable but the extra
length would not significantly increase the function strength. (A
longer key may be advisable if the randomness of the key is
considered weak.)

The "B bytes" length is the block size of the underlying HMAC operations. If the key is shorter than this block size, zeroes need to be appended to the key. If the key is longer than this block size, then it is hashed with the HMAC cryptographic hash.

The block size is 64 bytes for the following HMACs:

HMAC-MD5

HMAC-RIPEMD128

HMAC-RIPEMD160

HMAC-SHA1

In other words, HMAC wants to key to be exactly one block in length. If it's longer, it's hashed with zeros appended to fit exactly into one block.

Can we test this?

1
2
3
4
5
6
>>> key = os.urandom(65) # longer than one block
>>> msg = os.urandom(256)
>>> hmac.new(key, msg).hexdigest()
'f887a4146e94ed47405c97931798885d'
>>> hmac.new(hashlib.md5(key).digest(),msg).hexdigest()
'f887a4146e94ed47405c97931798885d'

We have a collision. In other words:

For: * 'H' a cryptographic hash * 'k' a private key * 'm' a message * 'B' an HMAC block size HMAC(k, m) == HMAC(H(k), m) for all 'k', where len(k) > B

Does this work with HMAC-SHA1?

1
2
3
4
5
6
>>> key = os.urandom(65)
>>> msg = os.urandom(256)
>>> hmac.new(key, msg, hashlib.sha1).hexdigest()
'1070312944223b36928382d7a53ca54f7204ad4a'
>>> hmac.new(hashlib.sha1(key).digest(), msg, hashlib.sha1).hexdigest()
'1070312944223b36928382d7a53ca54f7204ad4a'

How about all the SHA-2 functions? (SHA-224, SHA-256, SHA-384, & SHA-512)?

SHA-224:

1
2
3
4
5
6
7
>>> key = os.urandom(65)
>>> msg = os.urandom(256)
>>> hmac.new(key, msg, hashlib.sha224).hexdigest()
'9ea8c3f667e55e6e9c5d63c5dd1b569ca69e2cc69f5e3fa3f87e94ba'
>>> hmac.new(hashlib.sha224(key).digest(), msg, hashlib.sha224).hexdigest
()
'9ea8c3f667e55e6e9c5d63c5dd1b569ca69e2cc69f5e3fa3f87e94ba'

SHA-256:

1
2
3
4
5
6
7
>>> key = os.urandom(65)
>>> msg = os.urandom(256)
>>> hmac.new(key, msg, hashlib.sha256).hexdigest()
'2aa02e678fcfe7ecaa1475efb70fe284fe91cc81e5a9c543433b70f5f5112c4b'
>>> hmac.new(hashlib.sha256(key).digest(), msg, hashlib.sha256).hexdigest
()
'2aa02e678fcfe7ecaa1475efb70fe284fe91cc81e5a9c543433b70f5f5112c4b'

SHA-384:

1
2
3
4
5
6
7
8
9
>>> key = os.urandom(65)
>>> msg = os.urandom(256)
>>> hmac.new(key, msg, hashlib.sha384).hexdigest()
'0941e6502233a72d01beeec729eaa7db2469f8ce96339cd5b3b2c9a4684501e6a7025fac
6c9c20a511c48df76b453ec3'
>>> hmac.new(hashlib.sha384(key).digest(), msg, hashlib.sha384).hexdigest
()
'5380905fb89fee68836be076ebfccff600e3b89c6554840fe61fed01b049d6a6a77423d2
f5f4be1afb9d1c6f63b8b7fc'

SHA-512:

1
2
3
4
5
6
7
8
9
>>> key = os.urandom(65)
>>> msg = os.urandom(256)
>>> hmac.new(key, msg, hashlib.sha512).hexdigest()
'36063bdc2d02ce8ea4b01b40ba040094c640959e0cc5716f7a75f119cbc348aa93d555f8
6bfcdaee5dad4ec2e5d53ed4362f9df0720ec0e1272288d49a912f7e'
>>> hmac.new(hashlib.sha512(key).digest(), msg, hashlib.sha512).hexdigest
()
'8bfe675e3ca35a8680243d3747b5d3ce7ded1731e1a307cf5d1b00ae9243395ab94039f2
2585b417d7cbdf09f3d8dcf39c85ce147ff77c901c1a21f8de981b6a'

So it appears that the block size is indeed 64 bytes for MD5, SHA-1, SHA-224, and SHA-256, but for SHA-384 and SHA-512, it doesn't appear to be working. That is because the block size has changed to 128 bytes for these two functions. So, if our key is 129 bytes, we should be able to replicate collisions:

SHA-384 with a 129-byte key:

1
2
3
4
5
6
7
8
9
>>> key = os.urandom(129)
>>> msg = os.urandom(256)
>>> hmac.new(key, msg, hashlib.sha384).hexdigest()
'bda2b586637e3bd73a27919601d7d5a1c1743f1f9f5cb72a0aa874f832046f4bc396ff8e
307f9318dc404c4b432ca491'
>>> hmac.new(hashlib.sha384(key).digest(), msg, hashlib.sha384).hexdigest
()
'bda2b586637e3bd73a27919601d7d5a1c1743f1f9f5cb72a0aa874f832046f4bc396ff8e
307f9318dc404c4b432ca491'

SHA-512 with a 129-byte key:

1
2
3
4
5
6
7
8
9
>>> key = os.urandom(129)
>>> msg = os.urandom(256)
>>> hmac.new(key, msg, hashlib.sha512).hexdigest()
'd0153f8bb6a549539abbcff8ee5ac7592c48c082bbbb7b3cc95dcb2166f162e5c59bb7bb
3316e65d1481bd8697e8d3bc91deb46ad44845b972c57766f45c54bd'
>>> hmac.new(hashlib.sha512(key).digest(), msg, hashlib.sha512).hexdigest
()
'd0153f8bb6a549539abbcff8ee5ac7592c48c082bbbb7b3cc95dcb2166f162e5c59bb7bb
3316e65d1481bd8697e8d3bc91deb46ad44845b972c57766f45c54bd'

This isn't a poor Python implementation of HMAC. Try it in your favorite language, and you should be able to replicate the collisions. This is a "bug" in HMAC. If the key is longer than the block size, it's hashed with the HMAC cryptographic hash, then appended with zeros to fit the single block.

So what does this mean? It means that when choosing your HMAC keys, you should stay within one block size of bytes- 64 bytes or less for MD5, RIPEMD-128/160, SHA-1, SHA-224, SHA-256, and 128-bytes or less for SHA-384 and SHA-512. If you do this, you'll be fine.

Then again, you should probably be using NaCl or libsodium rather than piecing these cryptographic primitives manually yourself. These sorts of pitfalls are already handled for you.

This post is the result of a Twitter discussion between Scott Arciszewski, myself, and some others.

@adamcaudill @ErrataRob Aside from the fact that HMAC-MD5(m, k) and HMAC-MD5(m, MD5(k)) collide, when len(k) > 16

— Scott Arciszewski (@CiPHPerCoder) July 28, 2016

UPDATE: I'm not aware of any actual security implications where this is a problem that two distinct inputs produce the same HMAC digest. Ultimately, what are you trying to get out of HMAC? It's used as an authentication and integrity mechanism. So what if a long key, and the hash of that key produce the same HMAC digest? At 64 bytes, or 512-bits, the amount of work required at guessing the right key is anything but practical. Still, this is interesting.

Further Investigation Into Scrypt and Argon2 Password Hashing

Aaron Toponce — Thu, 30 Jun 2016 04:59:11 +0000

Introduction

In my previous post, I didn't pay close attention to the memory requirements of Argon2 when running my benchmarks. Instead, I just ran them until I got tired of waiting around. Further, I really didn't do justice to either scrypt nor Argon2 when showing the parallelization factor. So, as a result, I did a lot more benchmarks on both, so you could more clearly see how the cost affects the time calculating the password hash, and how parallelization can affect that time.

More scrypt Benchmarks and Clarification

So, let's run some more scrypt benchmarks, and take a closer look at what's going on. Recall that the cost parameters for scrypt are:

N: The CPU and RAM cost

r: The mixing loop memory block size

p: The product factor

The recommended cost factors from the Colin Percival are:

N: 16384 (2¹⁴)

r: 8

p: 1

To calculate the amount of memory used, we use the following equation:

Memory in bytes = (N * r * 128) + (r * p * 128)

So, according to the recommendation:

(2¹⁴ * 8 * 128) + (8 * 1 * 128) = 16,778,240 bytes

Or, about 16 megabytes. According to Anthony Ferrara, you should be using at least 16 MiB with scrypt. At 4 Mib or less, it is demonstrably weaker than bcrypt. So, when you're looking at the images below, you'll notice that the first memory result is in red text, to show that 8 MiB is the weak point with those cost factors, and the 16 MiB is green, showing cost factors from there up are preferred. As a result, anything between 8 MiB and 16 MiB is default black, in that you should probably be cautious using these cost factors. While you might be demonstrably stronger than bcrypt with these work loads, you're not at the developer's recommendation of 16 MiB.

So, knowing that, let's look at the results. Notice that when we increase our product factor, our execution time increases by that factor. Despite urban legend, this isn't a parallelization constant (well, it is, but it's not breaking up the problem into smaller ones- it's multiplying it). The idea is that once you've reached a reasonable memory cost, you can increase the execution time by creating more mixing loops with the same cost on each loop. So, instead of one mixing loop costing you 16 MiB, you can have two mixing loops costing you 16 MiB. We didn't divide the problem, we multiplied it. As such, our execution time will double from one mixing loop to two mixing loops.

This seems strange, and indeed it is, but you should start with "p=1" at the memory cost you can afford, then increase the proudct factor if you can donate more time to the execution. In other words, the product factor is designed for hardware limited scenarios. In general, you'll want to look at your execution time, and let the memory come in as an after thought (provided it's at or more than 16 MiB).

As in the last post, I have highlighted the green cells with interactive password sessions targeting half-of-a-second and red cells with symmetric key derivation targeting a full five seconds.

More Argon2 Benchmarks

When showing Argon2 in my last post, I did a poor job demonstrating several additional cost factors, and I didn't make it clear how the parallelization played a roll when keeping the same cost factor. As a result, I ran additional benchmarks to make it more clear exactly how CPU and RAM play with each other in your work load.

As a reminder, the cost parameters for Argon2 are as follows:

n: The number of iterations on the CPU.

m: The memory work load.

p: The parallelization factor.

Unlike scrypt, where a single factor manipulates both the CPU and RAM cost, Argon2 separates them out. You deliberately have two knobs to play with- "n" for the CPU and "m" for the RAM. But, one affects the other. If you are targeting a specific execution time, and you increase your memory factor by two, then your CPU work factor will be decreased by half. On the reverse, if you increase your CPU work factor by two, then your memory work factor will be decreased by half. So, affecting one work factor affects the other.

Why is this important? Well, let's consider setting your memory requirement into the gigabyte range. At 1 GiB, for an interactive login session of .5 seconds, you would need at least both cores working on the hash, and you would only get a single iteration. In other words, your work is entirely memory dependent without any significant CPU cost. Maybe you're trying to thwart FPGAs or ASICs with the large memory requirement. However, is it possible that an adversary has 1 GiB of on-die cache? If so, because you're entirely memory-dependent, and no CPU work load, you've been able to cater to the adversary, without significant hardware cost.

On the reverse, you could get CPU heavy with 2048 iterations to hit your .5 seconds execution time, but then you would only be using 256 KiB of memory. You're likely not defeating the FPGAs and ASICs that Argon2 is designed for, as you're almost entirely processor-driven.

So, what to do? It would probably be a good idea to target a balance- require a significant amount of memory, even if it doesn't break the on-die cache barrier, while also requiring a significant amount of processor work. Sticking with Colin's recommendation of 16 MiB (2¹⁴) of memory and 32 iterations on 4 cores for interactive logins is probably a good balance. Then again, it will all depend on your hardware, what you can expect in customer execution time, load, and other variables.

However, here are additional timings of Argon2, just like with scrypt, so you can see how parallelization affects identical costs. Again, green cells are targeting .5 seconds for interactive logins, and red cells are targeting 5 seconds for symmetric key derivation.

Conclusion

Hopefully, this will help you make a more educated decision about your cost factors when deploying either scrypt or Argon2 as your password hash or symmetric key derivation function. Remember, that you have a few things to consider when picking your costs:

Make it expensive for both CPU and memory.

Target a realistic execution time for the situation.

Guarantee that you can always meet these goals after deployment.

Also, don't deploy Argon2 into production quite yet. Let is bake for a while. If it still stands secure in 2020, then you're probably good to go. Otherwise, deploy scrypt, or the other functions mentioned in the prior post.

Let's Talk Password Hashing

Aaron Toponce — Tue, 28 Jun 2016 06:43:48 +0000

TL;DR

In order of preference, hash passwords with:

Argon2

scrypt

bcrypt

PBKDF2

Do not store passwords with:

MD5

md5crypt

sha512crypt

sha256crypt

UNIX crypt(3)

SHA-1/2/3

Skein

BLAKE2

Any general purpose hashing function.

Any encryption algorithm.

Your own design.

Plaintext

Introduction

Something that comes up frequently in crypto circles, aside from the constant database leaks of accounts and passwords, are hashing passwords. Because of the phrase "hashing passwords", developers who may not know better will think of using generic one-way fixed-length collision-resistant cryptographic hashing functions, such as MD5, SHA-1, SHA-256, or SHA-512, without giving a second thought to the problem. Of course, using these functions is problematic, because they are fast. Turns out, we don't like fast hashing functions, because password crackers do like fast hashing functions. The faster they can do, the sooner they can recover the password.

The Problem

So, instead of using MD5, SHA-1, SHA-256, SHA-512, etc., the cryptographic community got together, and introduced specifically designed password hashing functions, where a custom work factor is included as a cost. Separately, key derivation functions were also designed for creating cryptographic keys, where a custom work factor was also included as a cost here. So, with password-based key derivation functions and specifically designed password hashing functions, we came up with some algorithms that you should be using instead.

The Solution

The most popular algorithms of this type would include, in my personal order of preference from most preferred to least preferred:

Argon2 (KDF)

scrypt (KDF)

bcrypt

PBKDF2 (KDF)

The only difference between a KDF and a password hashing function, is that the digest length can be arbitrary with KDFs, whereas password hashing functions will have a fixed length output.

UPDATE:
Argon2 has withstood the test of time, and should be considered best practice. If Argon2 cannot be deployed, scrypt offers memory hardness, which makes ASICs and FPGAs unaffordable. bcrypt and PBKDF2 do not offer memory hardness, but provide a tunable work factor for CPUs. Note that I used to advocate sha256crypt and sha512crypt. I no longer provide this advice. See this post as to why.

For the longest time, I was not a fan of scrypt as a password hashing function. I think I've changed my mind. Even though scrypt is sensitive to the parameters picked, and it suffers from a time-memory trade-off (TMTO), it's still considered secure, provided you pick sane defaults. I also place bcrypt over Argon2, because Argon2 was just recently announced as the Password Hashing Contest winner. As with all cryptographic primitives, we need to time to analyze, attack, and pick apart the design. If after about 5 years, it still stands strong and secure, then it can be recommended as a solution for production. In the meantime, it's something certainly worth testing, but maybe not for production code. Finally, I prefer sha512crypt and sha256crypt over PBKDF2, mostly because they are included with every GNU/Linux distribution by default, they are based on the strong SHA-2 hashing function, which has had years and mountains of analysis, and unlike PBKDF2, you know exactly which hashing function is used. PBKDF2 could be using SHA-2 functions by default, or it could be using SHA-1. You'll need to check your library to be sure.

Different Strokes for Different Folks

Regardless, all of the above functions include cost parameters for manipulating how long it takes to calculate the hash from a password. It's less important exactly what the cost parameters are, and more important that you are targeting an appropriate time to work through the cost, and create the hash. This means you need to identify your threat model and your adversary.

The two common scenarios you'll find yourself in, are:

Password storage

Encryption keys

For password storage, your threat model is likely the password database getting leaked to the Internet, and password crackers around the world working on the hashes in the database to recover passwords. Thus, your adversary is malware, Anonymous, and password crackers. For encryption keys, your threat model is likely private encrypted keys getting compromised and side-channel attacks. Thus, your adversary is also malware, poor key exchanges, or untrusted networks. Knowing your threat model and your adversary changes how you approach the problem.

With password storage, you may be dealing with an interactive login, such as through a website. As such, you probably want the password hashing time to be quick, while still maintaining a work factor that would discourage large distributed attacks on your leaked database. Possibly, .5 seconds. This means if the database was leaked, the password cracker could do no more than 2 passwords per second. When you compare this to the millions of hashes per second a GPU could execute on Windows NTLM passwords, 2 passwords per second is extremely attractive. For encryption keys, you probably don't need to worry about interactive sessions, so taking 5 seconds to create the key from the password probably isn't a bad thing. So key crackers spending 5 seconds per guess trying to recover the password that created the encrypted private key is really nice.

bcrypt, sha256crypt, sha512crypt, & PBKDF2

So, knowing the work factors, what would it look like for the above algorithms? Below, I look at bcrypt, sha256crypt, sha512crypt, and PBKDF2 with their appropriate cost. I've highlighted the row green where a possible work factor could mean spending 0.5 seconds on hashing the password, and a red row where a possible work factor could mean spending 5 full seconds on creating a password-based encryption key.

Notice that for bcrypt, this means for password hashing, a factor of 13 would provide a cost of about 0.5s to hash the password, where a factor of 16 would get me close to my cost of about 5 seconds for creating a password-based key. For sha256crypt, sha512crypt, and PBKDF2, that seems to be about 640,000 and 5,120,000 iterations respectively.

scrypt

When we move to scrypt, things get a touch more difficult. With bcrypt, sha256crypt, sha512crypt, and PBKDF2, our cost is entirely a CPU load factor. Unfortunately, while possibly problematic for fast GPU clusters, they still fall victim to algorithm-specific FPGAs and ASICs. In order to combat this, we need to also include a memory cost, seeing as though memory on these devices is expensive. However, having both a CPU and a RAM cost, means multiple knobs to tweak. So, Colin Percival, the designer of scrypt, decided to bundle both the CPU and the RAM cost three factors: "N", "r", and "p". The resulting memory usage is calculated as follows:

Memory in bytes = (N * r * 128) + (r * p * 128)

There are a lot of suggestions out there about what's "best practice". It seems that you should at least have the following cost factors with scrypt, which provides a 16 MiB memory load:

N: 16384 (2¹⁴)

r: 8

p: 1

While you should be aware of the sensitivity of scrypt parameters, provided you are working with at least 16 MiB of RAM, you aren't any worse than other password hashing functions or KDFs. So, in the following tables, I increase the memory cost needed for the hash by tweaking the three parameters.

Update 2016-06-29: I've clarified these parameters in a follow-up post, which you should most definitely read at https://pthree.org/2016/06/29/further-investigation-into-scrypt-and-argon2-password-hashing/.

Because I only have access to a single-socket-quad core CPU in this testing machine, I wanted to limit my "p" cost to 1, 2, and 4, which is displayed in those tables. Further, I'm limited on RAM, and don't want to disrupt the rest of the applications and services running on the box, so I've limited my "r" cost to 4, 8, and 16 multiplied by 128 bytes (512 bytes, 1024 bytes, and 2048 bytes).

Interestingly enough, Colin Precival recommends 16 MiB (N=16384 (2¹⁴), r=8, p=1) for interactive logins and 16 MiB (N=131072 (2¹⁷), r=1, p=1) for symmetric key derivation. If I were targeting my 0.5s password hashing time, then I could improve that to 256 MiB (N=65536 (2¹⁶), r=8, p=1), or 2 GiB (N=2097152 (2²¹), r=8, p=1), if targeting just slightly more than 5 seconds for symmetric key derivation.

Argon2

Finally, we look at Argon2. Argon2 comes in two flavors- Argon2d and Argon2i; the first of which is data (d)ependent and the latter is data (i)independent. The former is supposed to be resistant against GPU cracking while the latter is supposed to be resistant against side-channel attacks. In other words, Argon2d would be suitable for password hashing, while Argon2i would be suitable for encryption key derivation. However, regardless of Argon2d or Argon2i, the cost parameters will perform the same, so we'll treat them as a single unit here.

Like scrypt, Argon2 has both a CPU and a RAM cost. However, both are handled separately. The CPU cost is handled through standard iterations, like with bcrypt or PBKDF2, and the RAM cost is handled through specifically ballooning the memory. When I started playing with it, I found that just manipulating the iterations felt very much like bcrypt, but I could affect the overall time it took to calculate the hash by just manipulating the memory also. When combining the two, I found that iterations affected the cost more than the RAM, but both had significant say in the calculation time, as you can see in the tables below. As with scrypt, it also has a parallelization cost, defining the number of threads you want working on the problem:

Note the RAM cost between 256 KiB and 16 MiB, in addition to the number of iterations and the processor count cost. As we balloon our RAM, we can bring our iteration cost down. As we require more threads to work on the hash, we can bring that iteration count down even further. Regardless, we are trying to target 0.5s for an interactive password login, and a full 5 seconds for password-based encryption key derivation.

Conclusion

So, what's the point? When hashing passwords, whether to store them on disk, or to create encryption keys, you should be using password-based cryptographic primitives that were specifically designed for this problem. You should not be using general purpose hashing functions of any type, because of their speed. Further, you should not be rolling out your own "key-stretching" algorithm, such as recursively hashing your password digest and additional output.

Just keep in mind- if the algorithm was specifically designed to handle passwords, and the cost is sufficient for your needs, threat model, and adversary, then you're doing just fine. Really, you can't go wrong with any of them. Just avoid any algorithm not specifically designed around passwords. The goal is security through obesity.

Best practice? In order of preference, use:

scrypt

bcrypt

Argon2

sha512crypt

sha256crypt

PBKDF2

Do not use:

MD5

md5crypt

UNIX crypt(3)

SHA-1/2/3

Skein

BLAKE2

Any general purpose hashing function.

Any encryption algorithm.

Your own design.

Plaintext

The Physics of Brute Force

Aaron Toponce — Sun, 19 Jun 2016 19:34:14 +0000

Introduction

Recently, MyDataAngel launched a Kickstarter project to sell a proprietary encryption algorithm and software with 512-bit and 768-bit symmetric keys. The motivation was that 128-bit and 256-bit symmetric keys just isn't strong enough, especially when AES and OpenSSL are older than your car (a common criticism they would mention in their vlogs). Back in 2009, Bruce Schneier blogged about Crypteto having a 49,152-bit symmetric key. As such, their crypto is 100% stronger, because their key is 100% bigger (than 4096-bit keys?). Meganet, which apparently still exists, has a 1 million-bit symmetric key!

It's hard to take these encryption products seriously, when there are no published papers on existing primitives, no security or cryptography experts on your team, and you're selling products with ridiculous key lengths (to be fair, 512-bit and 768-bit symmetric keys aren't really that ridiculous). Nevermind that your proprietary encryption algorithm is not peer-reviewed nor freely available to the public. Anyone can create a symmetric encryption algorithm that they themselves cannot break. The trick is releasing your algorithm for peer review, letting existing cryptography experts analyze the design, and still coming out on top with a strong algorithm (it wouldn't hurt if you analyzed existing algorithms and published papers yourself).

So with that, I want to talk a bit about the length of symmetric keys, and what it takes to brute force them. Bruce Schneier addressed this in his "Applied Cryptography" book through the laws of thermodynamics. Unfortunately, he got some of the constants wrong. Although the conclusion is basically the same, I'm going to give you the same argument, with updated constants, and we'll see if we come to the same conclusion.

Counting Bits

Suppose you want to see how many bits you can flip in one day by counting in binary every second. Of course, when you start counting, you would start with "0", and your first second would flip your first bit to "1". Your second second would flip your second bit to "1" while also flipping your first bit back to "0". Your third second would flip the first bit back to "1", and so forth. Here is a simple GIF (pronounced with a hard "G") counting from 0 to 127, flipping bits each second.

By the end of a 24-hour period, I would have hit 86,400 seconds, which is represented as a 17-bit number. In other words, every 24 hours, flipping 1 bit per second, I can flip every combination of bits in a 16-bit number.

By the end of a single year, we end up with a 25-bit number, which means flipping a single bit every second can flip every combination of 24-bits every year.

So, the obvious question is then this- what is the largest combination of bits that I can flip through to exhaustion? More importantly, how many computers would I need to do this work (what is this going to cost)?

Some Basic Physics

One of the consequences of the second law of thermodynamics, is that it requires energy to do a certain amount of work. This could be anything from lifting a box over your head, to walking, to even getting out of bed in the morning. This also includes computers and hard drives. When the computer wishes to store data on disk, energy is needed to do that work. This is expressed with the equation:

Energy = kT

Where "k" is Boltzmann's constant of 1.38064852Ã—10^âˆ’16 ergs per Kelvin, and "T" is the temperature of the system. I'm going to use ergs as our unit, as we are speaking about work, and an "erg" is a unit of energy. Of course, a "Kelvin" is a unit of temperature, where 0 Kelvin is defined as a system devoid of energy; also known as "absolute zero".

It would make the most sense to get our computer as absolutely cool as possible to maximize our output while also minimizing our energy requirements. Current background radiation in outer space is about 2.72548 Kelvin. To run a computer cooler than that would require a heat pump, which means adding additional energy to the system than what is needed for our computation. So, we'll run this ideal computer at 2.72548 Kelvin.

As a result, this means that to flip a single bit with our ideal computer, it requires:

Energy = (1.38064852Ã—10^âˆ’16 ergs per Kelvin) * (2.72548 Kelvin) = 3.762929928*10^-16 ergs

Some Energy Sources

The Sun

Now that we know our energy requirement, let's start looking at some energy sources. The total energy output from our star is about 1.2*10³⁴ Joules per year. Because one Joule is the same as 1*10⁷ ergs, then the total annual energy output of the Sun is about 1.2*10⁴¹ ergs. So, doing some basic math:

Bits flipped = (1.2*10⁴¹ ergs) / (3.762929928*10^-16 ergs per bit) = 3.189004374*10⁵⁶ bits

3.189004374*10⁵⁶ bits means I can flip every combination of bits in a 2¹⁸⁷-bit number, if I could harness 100% of the solar energy output from the sun each year. Unfortunately, our Sun is a weak star.

A Supernova

A supernova is calculated to release something around 10⁴⁴ Joules or 10⁵¹ ergs of energy. Doing that math:

Bits flipped = (10⁵¹ ergs) / (3.762929928*10^-16 ergs per bit) = 2.657503608*10⁶⁶ bits

2.657503608*10⁶⁶ bits is approximately 2²²⁰-bits. Imagine flipping every bit in a 220-bit number in an orgy of computation.

A Hypernova

A hypernova is calculated to release something around 10⁴⁶ Joules or 10⁵³ ergs of energy. Doing that math:

Bits flipped = (10⁵³ ergs) / (3.762929928*10^-16 ergs per bit) = 2.657503608*10⁶⁸ bits

2.657503608*10⁶⁸ bits is approximately 2²²⁷-bits. This is a computation orgy turned up to 11.

Of course, in all 3 cases, I would have to harness 100% of that energy into my ideal computer, to flip every combination of these bits. Never mind finding transportation to get me to that hypernova, the time taken in that travel (how many millions of light years away is it?), and the cost of the equipment to harness the released energy.

Bitcoin Mining

As a comparative study, Bitcoin mining has almost surpassed 2 quintillion SHA-256 hashes per second. If you don't think this is significant, it is. That's processing all of a 60-bit number (all 2⁶⁰ bits) every second, or an 85-bit number (all 2⁸⁵ bits) every year. This is hard evidence, right now, of a large scale 256-bit brute force computing project, and it's barely flipping all the bits in an 85-bit number every year. The hash rate would have to double (4 quintillion SHA-256 hashes every second) to surpass flipping all the bits in an 86-bit number every year.

Further, we do not have any evidence of any clustered supercomputing project that comes close to that processing rate. It can be argued that the rate of Bitcoin mining is the upper limits of what any group of well-funded organizations could afford (I think it's fair to argue several well-funded organizations are likely Bitcoin mining). To produce a valid conspiracy theory to counteract that claim, you would need to show evidence of organizations that have their own semiconductor chip manufacturing, that has outpaced ARM, AMD, Intel and every other chip maker on the market, by several orders of magnitude.

Regardless, we showed the amount of energy needed anyway to flip every bit in a 256-bit number, and the laws of thermodynamics strongly imply that it's just not physically possible.

Asymmetric Cryptography

Things change when dealing with asymmetric cryptography. Now, instead of creating a secret 256-bit number, you're using mathematics, such as prime number factorization or elliptic curve equations. This changes things drammatically when dealing with key lengths, because even though we assume some mathematical problems are easy to calculate, but hard to reverse, we need to deal with exceptionally large numbers to give us the security margins necessary to prove that hardness.

As such, it because less of a concern about energy, and more a concern about time. Of course, key length is important up to a point. We just showed with the second law of thermodynamics, that brute forcing your way from 0 to 2²⁵⁶ is just physically impossible. However, finding the prime factors of that 256-bit number is a much easier task, does not require as much energy, and can be done by only calculating no more than half of the square root amount of numbers (in this case, 2¹²⁷, assuming we're only testing prime numbers).

As such, we need to deal with prime factors that are difficult to find. It turns out that it's not enough to just have a 512-bit private key to prevent the Bad Guys from finding your prime factors. This is largely because there are efficient algorithms for calculating and testing prime numbers. So, it must also be expensive to calculate and find those primes. Currently, best practice seems to be generating 2 1024-bit prime factors to produce a 2048-bit private RSA key.

Fixed-length Collision-resistant Hashing

Fixed-length collision-resistant hashing puts a different twist on brute force searching. The largest problem comes from the Birthday Attack. This states that if you have approximately the square root of 2 times 365 people in the room (about 23 people), the chances that any two people share the same birthday is 50%. Notice that this comes from any two people in the room. This means that you haven't singled out 1 person, and the odds that the other 22 people in the room have that same birthday is 50%. This isn't a pre-collision search. This is a blind search. You ask the first person what their birthday is, and compare it with the other 22 people in the room. Then you ask the second person what their birthday is, and compare it with the remaining 21 people in the room. And so on and so forth. After working through all 23 people comparing everyone's birthday to everyone else's birthday, the odds you found a match between two random people is 50%.

Why is this important? Suppose you are processing data with SHA-1 (160-bit output). You only need to calculate 2⁸⁰ SHA-1 hashes before your odds of finding a duplicate hash out of the currently calculated hashes reaches 50%. As we just learned with Bitcoin, this is practical within one year with a large orchestrated effort. Turns out, SHA-1 is weaker that that (we only need to calculate 2⁶⁴ hashes for a 50% probability), which is why the cryptographic community has been pushing so hard to get everyone and everything away from SHA-1.

Now you may understand why 384-bit and 512-bit (and more up to 1024-bit) cryptographically secure fixed-length collision-resistant hashing functions exist. Due to the Birthday Attack, we can make mince meat of our work.

Conclusion

As clearly demonstrated, the second law of thermodynamics provides a clear upper bound on what can be found with brute force searches. Of course, brute force searches are the least effective way to find the private keys you're looking for, and indeed, there are more efficient ways to get to the data. However, if you provide a proprietary encryption algorithm with a closed-source implementation, that uses ridiculously long private keys, then it seems clear that you don't understand the physics behind brute force. If you can't grasp the simple concept of these upper bounds, why would I want to trust you and your product in other areas of security and data confidentiality?

Quantum computing does give us some far more efficient algorithms that classical computing cannot achieve, but even then, 256-bits still remains outside of the practical realm of mythical quantum computing when brute force searching.

As I've stated many times before- trust the math.

Webcam Random Number Generation

Aaron Toponce — Sun, 12 Jun 2016 21:13:28 +0000

A couple weeks ago, I purchased a lava lamp for $5 at a thrift store. It was in brand spanking new condition, and worked like a charm. The only thing going through my head at the time? I can't wait to point my webcam at it, and start generating some random numbers! Okay, well that, and mood lighting for the wife.

I purchased a lava lamp over the weekend. Inefficient and slow random numbers, here we come! pic.twitter.com/umE0VdSP8l

— Aaron Toponce (@AaronToponce) May 31, 2016

Anyway, I wrote a quickie Python script which will capture a frame from the webcam, hash it with a keyed BLAKE2, and output the result to a FIFO file to be processed. The BLAKE2 digest of the frame also becomes the key for the next BLAKE2 instance, making this script very CBC-like in execution (the first function is keyed from /dev/urandom, and each digest keys the next iteration).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
#!/usr/bin/python

# Create true random seeds (near as we can tell) with your webcam.
#
# This script will use your webcam pointed at a source of entropy, keyed with
# random data from the OS CSPRNG. You could point the camera at:
#
# * Lava lamps
# * Plasma globes
# * Double pendulums
# * Rayleigh-Benard convection
# * Brownian motion
#
# Performance is ~ 2 KiB/s.
# Requires pyblake2: https://pypi.python.org/pypi/pyblake2
#
# Released to the public domain.

import os
import cv2
import pyblake2

cap = cv2.VideoCapture(0)
webcamfile = '/tmp/webcamfile.fifo'
key = os.urandom(64)

try:
os.mkfifo(webcamfile)
except OSError, e:
print "Cannot create FIFO: {0}".format(e)
else:
fifo = open(webcamfile, 'w+')

while True:
ret, frame = cap.read()
if not ret:
break

b2sum = pyblake2.blake2b(key)
b2sum.update(frame)
digest = b2sum.digest()
key = digest

fifo.write(digest)
fifo.flush()

cv2.imshow('webcamlamp', frame)
k = cv2.waitKey(1) & 0xFF
if k == 27:
break

fifo.close()
os.remove(webcamfile)
cap.release()
cv2.destroyAllWindows()

As you'll notice in the code, you should point your webcam at a source of either chaotic randomness, like a lava lamp, or quantum randomness, like a plasma globe. Because the frame is whitened with a keyed BLAKE2, it could be considered as a true random number generator, or you could use it as a seed for a cryptographically secure pseudorandom number generator, such as those shipped with modern operating systems. If you do use this as a TRNG, realize that it's slow- it only operates at about 2 KiBps.

Here is a screenshot of the webcam itself looking at a USB desk plasma globe, that you can purchase of ThinkGeek for $10.

The data is sent to a FIFO in /tmp/. If you don't do anything with the data, and let the buffer fill, the script will hang, until you read data out of the FIFO. As such, you could do something like this to reseed your CSPRNG (of course, it's not increasing the entropy estimate, just reseeding the generator):

$ < /tmp/webcamrng.fifo > /dev/random

Lava lamps and plasma globes are only the beginning. Anything quantum or chaotic that can be visually observed also works. Things like:

Double pendulums

Brownian motion

Rayleigh-Benard convection

CCD noise from the webcam itself

A bouncing ball on a sinusoidal vibrating table

So, there you have it. Plasma globes and lava lamps providing sufficiently random data via a webcam, either to be used as a secret seed, or as a TRNG itself. Any other systems that could be used to point a webcam at, or suggestions for improvement in the Python code, let me know in the comments.

CPU Jitter Entropy for the Linux Kernel

Aaron Toponce — Wed, 25 May 2016 02:14:23 +0000

Normally, I keep a sharp eye on all things cryptographic-related with the Linux kernel. However, in 4.2, I missed something fantastic: jitterentropy_rng.ko. This is a Linux kernel module that measures the jitter of the high resolution timing available in modern CPUs, and uses this jitter as a source of true randomness. In fact, using the CPU timer as a source of true randomness isn't anything new. If you're read my blog for some time, you're already familiar with haveged(8). This daemon also collects CPU jitter and feeds the collected data into the kernel's random number generator.

The main site is at http://www.chronox.de/jent.html and the PDF paper describing the CPU jitter entropy can be found on the site.

So why the blog post about jitterentropy_rng.ko? Because now that we have something in the mainline kernel, we get a few benefits:

More eyes are looking at the code, and can adjust, analize, and refine the entropy gathering process, making sure it's not to aggressive nor conservative in its approach.

We now have something that can collect entropy much earlier in the boot sequence, even before the random number generator has been initialized. This means we can have a properly seeded CSPRNG when the CSPRNG is initialized.

While not available now, we could have a kernelspace daemon collecting entropy and feeding it to the CSPRNG without the need for extra software.

This isn't just for servers, desktops, and VMs, but anything that runs the Linux kernel on a modern CPU, including Android phones, embedded devices, and SoC.

While haveged(8) has been a good solution for a long time, it has been heavily criticized, and it seems development on it has stalled. Here is another software solution for true randomness without the need of potentially dangerous 3rd party USB random number generators.

You don't need Intel's RDRAND. Any modern CPU with a high resolution timer will work. AMD, SPARC, ARM, MIPS, PA-RISC, Power, etc.

As mentioned in the list, unfortunately, loading the kernel doesn't automatically top-off the entropy estimate of the internal state of the CSPRNG (/proc/sys/kernel/random/entropy_avail). As such, /dev/random will still block when the estimate is low or exhausted. So you'll still need to run a userspace daemon to prevent this behavior. The author has also shipped a clean, light userspace daemon that just reads the data provided by the jitterentropy_rng.ko kernel module, and uses ioctl(2) to increase the estimate. The jitterentropy_rng.ko module provides about 10 KBps of random data.

Again, this isn't anything that something like haveged(8) doesn't already have access to. However, by taking advantage of a loaded kernel module, we can ensure that randomness is being collected before the CSPRNG is initialized. So, when CSPRNG initialization happens, we can ensure that it is properly seeded on first boot, minimizing the likelihood that exact keys will be created on distinct systems. This is something haveged(8) can't provide, as it runs entirely in userspace.

Unfortunately, jitterentropy-rngd(8) isn't available in the Debian repositories yet, so you'll need to download the compressed tarball from the author's website, manually compile and install yourself. However, he does ship a systemd(8) service file, which makes it easy to get the daemon up and running on boot with minimal effort.

I've had the jitterentropy_rng.ko module installed with the jitterentropy-rngd(8) userspace daemon running all day today, without haveged(8), and needless to say, I'm pleased. It keeps the CSPRNG entropy estimate sufficiently topped off for software that still relies on /dev/random (please stop doing this developers- start using /dev/urandom please) and provides adequate performance. Near as I can tell, there is not a character device created when loading the kernel module, so you can't access the unbiased data before feeding it into the CSPRNG. As such, I don't have a way to test its randomness quality. Supposedly, there is a way to access this via debugfs, but I haven't found it.

Anyway, I would recommend using jitterentropy_rng.ko and jitterentropy-rngd(8) over haveged(8) as the source for your randomness.

Weechat Relay With Let's Encrypt Certificates

Aaron Toponce — Fri, 20 May 2016 15:40:26 +0000

I've been on IRC for a long time. Not as long as some, granted, but likely longer than most. I've had my hand in a number of IRC clients, mostly terminal-based. Yup, I was (shortly) using the ircII client, then (also shortly) BitchX. Then I found irssi, and stuck with that for a long time. Search irssi help topics on this blog, and you'll see just how long. Then, after getting hired at XMission in January 2012, I switched full-time to WeeChat. I haven't looked back. This IRC client is amazing.

One of the outstanding features of WeeChat is the relay, effectively turning your IRC client into a bouncer. This feature isn't unique- it's in irssi also. However, the irssi proxy does not support SSL (2009). The WeeChat relay does. And with Let's Encrypt certificates freely available, this is the perfect opportunity to use TLS with a trusted certificate.

This post assumes that you are running WeeChat on a box that you can control the firewall to. In my case, I run WeeChat on an externally available SSH server behind tmux. With Let's Encrypt certificates, you will need to provide a FQDN for your Common Name (CN). This is all part of the standard certificate verification procedure. I purchased a domain that points to the IP of that server, and you will need to do the same.

The official Let's Encrypt "certbot" package used for creating Let's Encrypt certificates is already available in Debian unstable. A simple "apt install certbot" will get that up and running for you. Once installed, you will need to create your certificate.

$ certbot certonly --standalone -d weechat.example.com -m aaron.toponce@gmail.com

Per Let's Encrypt documentation, you needs ports 80 and 443 open to the world when creating and renewing your certificate. The execution will create four files:

# ls -l /etc/letsencrypt/ total 24 drwx------ 3 root root 4096 May 19 12:36 accounts/ drwx------ 3 root root 4096 May 19 12:39 archive/ drwxr-xr-x 2 root root 4096 May 19 12:39 csr/ drwx------ 2 root root 4096 May 19 12:39 keys/ drwx------ 3 root root 4096 May 19 12:39 live/ drwxr-xr-x 2 root root 4096 May 19 12:39 renewal/ # ls -l /etc/letsencrypt/live/weechat.example.com/ total 0 lrwxrwxrwx 1 root root 43 May 19 12:39 cert.pem -> ../../archive/weechat.example.com/cert1.pem lrwxrwxrwx 1 root root 44 May 19 12:39 chain.pem -> ../../archive/weechat.example.com/chain1.pem lrwxrwxrwx 1 root root 48 May 19 12:39 fullchain.pem -> ../../archive/weechat.example.com/fullchain1.pem lrwxrwxrwx 1 root root 46 May 19 12:39 privkey.pem -> ../../archive/weechat.example.com/privkey1.pem

The "cert.pem" file is your public certificate for your CN. The "chain.pem" file in the Let's Encrypt intermediate certificate. The "fullchain.pem" file is the "cert.pem" and "chain.pem" files combined. Of course, the "privkey.pem" file is your private key. For the WeeChat relay, it needs the "privkey.pem" and "fullchain.pem" files combined into a single file.

Because the necessary directories under "/etc/letsencrypt/" are accessible only by the root user, you will need root access to copy the certificates out and make them available to WeeChat, which hopefully isn't running as root. Also, Let's Encrypt certificates need to be renewed no sooner than every 60 days and no later than every 90 days. So, not only will you want to automate renewing the certificate, but you'll probably want to automate moving it into the right directory when the renewal is complete.

As you can see from above, I setup my certificate on a Thursday at 12:39. So weekly, on Thursday, at 12:39, I'll check to see if the certificate needs to be nenewed. Because it won't renew any more frequently than every 60 days, but I have to have it renewed every 90 days, this gives be a 30-day window in which to get the certificate updated. So, I'll keep checking weekly. If a renewal isn't needed, the certbot(1) tool will gracefully exit. If a renewal is needed, the tool will update the certificate. Unfortunately, certbot(1) does not provide a useful exit code when renewals aren't needed, so rather than parsing text, I'll just copy the new certs into my WeeChat directory, regardless if they get updated or not.

So, in my root's crontab, I have the following:

39 12 * * 4 /usr/local/sbin/renew.sh

Where the contents of "/usr/local/sbin/renew.sh" are:

#!/bin/bash certbot renew -q cat /etc/letsencrypt/live/weechat.example.com/privkey.pem \ /etc/letsencrypt/live/weechat.example.com/fullchain.pem > \ ~aaron/.weechat/ssl/relay.pem chown aaron.aaron ~aaron/.weechat/ssl/relay.pem

Now the only thing left to do is setup the relay itself in WeeChat. So, from within the client:

/relay sslcertkey /relay add ssl.weechat 8443

You will need port 8443 open in your firewall, of course.

That's it. I have had some problems with certificate caching in WeechatAndroid it seems. So far, I have had to manually restart the relay in WeeChat, and flush the cache in WeechatAndroid and restart it to get the new certificate (I was previously using a self-signed certificate). Hopefully, this can also be automated, so I don't have to manually keep restarting the relay in WeeChat and flushing the cache in WeechatAndroid.

Regardless, this is how you use Let's Encrypt certificates with WeeChat SSL relay. Hopefully this is beneficial to someone.

Say Allo To Insecurity

Aaron Toponce — Thu, 19 May 2016 12:45:24 +0000

Yesterday, Google announced two new encrypted messaging apps called "Allo" and "Duo". There has been some talk about the security of Allo's end-to-end encryption and incognito mode. Most of it was speculation, until Thai Duong blogged about it. Well, it's time to see what he said, and see if Allo stands up to scrutiny.

"Allo offers two chat modes: normal and incognito. Normal is the default, but incognito can be activated with one touch. I want to stress that both modes encrypt chat messages when they are in transit or at rest. The Allo clients talk to Google servers using QUIC or TLS 1.2. When messages are temporarily stored on our servers waiting for delivery they are also encrypted, and will be deleted as soon as they're delivered."

There are a few things in this paragraph that need some explanation. First, "both modes encrypt chat messages when they are in transit or at rest". This is good, but the devil is in the details. In transit, Thai explains how they're encrypted: "Allo clients talk to Google servers using QUIC or TLS 1.2". This has a couple of ramifications. First, this isn't end-to-end encryption (E2E). This is client-server encryption, which means both the server and the client are encrypting and decrypting the data. As a result, any Google employee with the appropriate privileges can read the messages delivered to Google servers. That's sort of the point of why E2E encryption exists- to prevent this from happening.

Second, kudos for storing the messages encrypted on disk. But, realize that Google has the master key to decrypt these messages if needed. Also, kudos for deleting them off of Google's servers as soon as they're delivered. However, just like VPN service providers promising they don't log your connections, Google promising not to log your message sort of falls into this category. That is, although they might not be storing the messages right now, they may store them later, especially if presented with a warrant from law enforcement. So, Google promising to not store your messages really doesn't amount to much other than maybe they don't want to unnecessarily chew through disk unless forced. Just remember, Google isn't going to go to jail for you, so they will comply with law enforcement.

"In normal mode, an artificial intelligence run by Google (but no humans including the Allo team or anyone at Google) can read your messages. This AI will use machine learning to analyze your messages, understand what you want to do, and give you timely and useful suggestions. For example, if you want to have dinner, it'll recommend restaurants or book tables. If you want to watch movies, it can buy you tickets.

Like it or not, this AI will be super useful. It's like having a personal assistant that can run a lot of errands for you right in your pocket. Of course, to help it help you you'll have to entrust it with your chat messages. I really think that this is fine, because your chat messages are used to help you and you only, and contrary to popular beliefs Google never sells your personal information to anyone."

Herein lies the real reason why E2E is not enabled by default- Google would like to mine your messages, on your phone, and present you with real-time ads. Ads not just on your phone, but likely when you're logged into your Google account on your desktop or laptop as well. If the data is E2E encrypted, this poses a problem for a company that has made Big Bucks off of advertising. With incognito mode, you are enabling E2E encryption, and the AI no longer has access to this data. Application and browser ads become generic, or must get their mining elsewhere. Because Google is allowing an AI to mine Allo messages for targeted ads, could it be possible that this same AI could be mining other data on your phone for the same goal? Could this AI be mining your email, Twitter, Facebook, photos, and other data? Will this AI be shipping solely with Allo, or will it be a separate service in Android N?

While Google might not be selling your data, they are making a percentage of sales that come from ads. The more targeted the ads become, the more likely you are to make a purchase, and the more likely Google will be to get a percentage of that sale. Google isn't selling your data, but they are making money off of it.

"But what if I want to stay off the grid? What if I don't want even the AI or whatever to see my messages?

"That's fine. We understand your concerns. Everybody including me has something to hide. This is why we develop the incognito mode. In this mode, all messages are further encrypted using the Signal protocol, a state of the art end-to-end chat encryption protocol which ensures that only you and your recipients can read your messages."

WhatsApp, acquired by Facebook, and pushing nearly one billion active messaging accounts, recently enabled E2E encryption also with the Signal Protocol. The difference being, with WhatsApp, E2E is default for every account when they update their app. E2E is not default for Allo, and only enabled for incognito mode. So, if "everybody including me has something to hide", then why isn't E2E default with Allo?

Thai then quotes a survey explaining that users want self-destructing messages more than E2E. He explains that survey with (emphasis mine):

"So to most users what matters the most is not whether the NSA can read their messages, but the physical security of their devices, blocking unwanted people, and being able to delete messages already sent to other people. In other words, their threat model doesn't include the NSA, but their spouses, their kids, their friends, i.e., people around and near them. Of course it's very likely that users don't care because they don't know what the NSA has been up to. If people know that the NSA is collecting their dick pics, they probably want to block them too. At any rate, NSA is just one of the threat sources that can harm normal users."

Sure, my threat model is also losing my phone. I find that much more likely than either the NSA confiscating my phone, issuing a warrant to collect my data, or decrypting my traffic in real-time (which isn't practical anyway). However, while the NSA isn't in my threat model, the NSA should be in Google's threat model. In other words, Google should be worrying about the NSA for me.

This is why I created d-note is is running at https://secrets.xmission.com. As a system administrator, I don't want to turn over logs to the NSA or any other organization. As such, the messages are encrypted server-side before stored to disk, and destroyed immediately upon viewing. The goal isn't necessarily to protect the end user, but to protect the server administrator. By legitimately not being able to provide logs or data when a warrant is issued is extremely valuable.

Google should be protecting the "dick pics" of users from getting into the NSA hands. Apple recently made a strong stand here against the FBI regarding Syed Farook's iPhone. Apple technically could not help the FBI, because of the protections that Apple baked into their product. Apple's hands were tied. As such, the FBI wanted to set a precedent about enabling government backdoors into the OS for future releases, so they would no longer be blocked from access. Apple is protecting the "dick pics" of its users from the NSA, FBI, and everyone else. Why isn't Google? As we mentioned earlier, the answer to that question is data mining and advertising revenue.

"This is why I think end-to-end encryption is not an end in itself, but rather a means to a real end which is disappearing messaging. End-to-end encryption without disappearing messaging doesn't cover all the risks a normal user could face, but disappearing messaging without end-to-end encryption is an illusion. Users need both to have privacy in a way that matters to them."

Emphases mine. So, Thai recognizes that disappearing messaging without E2E encryption is an illusion. So, why isn't it default? The higher powers that be, likely. He mentions in his conclusion that he would like E2E to be default, with a single tap. Something of an option with "Always start in incognito", thus always starting with E2E and always having self-destructing messages. However, rather than opt-in, it should be opt-out. If the prior message history is more important to you than the security of E2E encryption and self-destructing messages, then it should be something that you switch. If SnapChat is so popular because of self-destructing massages, and WhatsApp has one billion users with E2E encryption be default, Google, a company larger than both combined, should be able to do the same.

Finally, one point that Thai does not mention in his post. Allo is proprietary closed-source software. From a security perspective, this is problematic. First, because you don't have access to the source, you cannot audit it to make sure it holds up to the security claims that it has. As security and software engineers, not having access to the source code should be a major block when considering the use of non-free software.

Second, without access to the source code, you cannot create reproducible builds. Even if you did have access to the source code, are you sure the binary you have installed matches the binary you can build? If not, how do you know the binary isn't spying on you? Or compromised? Or just compiled incorrectly, causing undesired behavior? Not being able to create reproducible builds of software means not being able to verify the integrity of the shipped binary. Debian is making it a high priority to ship packages with reproducible builds. It's important to Debian, because they want to be transparent with their userbase. If you don't trust Debian is doing what they claim, you can rebuild the binaries and packages yourself, and they should match what Debian has shipped.

I know this sounds very Richard Stallman and GNU, but proprietary closed-source software is scary when it comes to security. While your immediate threat model might just be those you interact with on a daily basis, the immediate threat model to Google, Apple, SnapChat, and others, are well-funded organizations that have legal weight. Ultimately, they're after your data, which in the end, puts them in your threat model. There are no safety or security guarantees with proprietary closed-source software. You are at the mercy of the software vendor to Do The Right Thing, and many companies just don't.

So, while Allo might be the new kid on the block with E2E encrypted and self-destructing messages, as I've shown, it can't be trusted for your security and privacy. You're best off ignoring it, not recommending it to family and friends, and sticking with Free Software alternatives where E2E messages are default.

How To Always Encrypt Chromium Saved Passwords On GNU/Linux - No Matter What

Aaron Toponce — Sun, 01 May 2016 21:47:03 +0000

One of the things that has always bothered me about the Chromium project (the project the Google Chrome browser is based on) is that passwords are encrypted, if and only if your operating system provides an authentication API through your account login. For example, on Windows, is is accomplished through the "CryptProtectData" function. This function uses your existing account credentials when logging into your computer, as a "master key" to encrypt the passwords on your hard drive. For Mac OS X, this is accomplished with Keychain, and with GNU/Linux users, KWallet if you're running KDE or GNOME Keyring if you're running GNOME.

In all those cases, your saved passwords will be encrypted before getting saved to disk. But, what if you're like me, and do not fall into any of those situations? Now, granted, GNU/Linux and BSD users (you're welcome) make up about 3% of the desktop installs.

Of that 3%, although I don't have any numbers, maybe 2/3 run GNOME or KDE. That leaves 1 out of every 100 users where Chromium is not encrypting passwords on disk by default. For me, who lands in that 1%, this is unacceptable. So, I wanted a solution.

Before I go any further, let me identify the threat and adversary. The threat is offline disk analysis. I'm going to assume that you're keeping your operating system up-to-date with the latest security patches, and that your machine is not infected with malware. Instead, I'm going to assume that after you are finished using your machine, upgrading the hardware, or a hard drive fails, that the disk is discarded. I'm further going to assume that you either can't or didn't digitally wipe or physically destroy the drive once decommissioned. So, the threat is someone getting a hold of that drive, or laptop, or computer, and imaging the drive for analysis. This means that our adversary is a global adversary- it could be anyone.

Now, the obvious solution would be to run an encrypted filesystem on that drive. dm-crypt with or without LUKS makes this possible. But, let's assume you're not running FDE. Any options? In my case, I run eCryptfs, and store the Chromium data there, symbolically linking to it from the default location.

By default, Chromium stores its passwords in ~/.config/chromium/Default/Login\ Data. This is an SQLite 3.x database, and as mentioned, the passwords are stored in plaintext. A simple solution is to create an eCryptfs private directory, and symlink the database to that location. However, Chromium also stores cookies, caches, and other data in ~/.config/chromium/ that might be worth encrypting as well. So, you can just symlink the entire ~/.config/chromium/ directory to the eCryptfs mount.

I'll assume you've already setup eCryptfs and have it mounted to ~/Private/. If not, run the "ecryptfs-setup-private" command, and follow the prompts, then run "ecryptfs-mount-private" to get it mounted to ~/Private/.

Make sure Chromium is not running and move the ~/.config/chromium/ directory to ~/Private/. Then create the necessary symlink, so Chromium does not create a new profile:

$ mv ~/.config/chromium/ ~/Private/ $ ln -s ~/Private/chromium/ ~/.config/

At this point, all your Chromium data is now stored in your eCryptfs encrypted filesystem, and Chromium will follow the symlink, reading and writing passwords in the encrypted mount. This means, no matter if using KWallet or GNOME Keyring, or nothing at all, your passwords will be always be encrypted on disk. Of course, in the SQLite 3.x database, the passwords are still in plaintext, but the database file is encrypted in eCryptfs, thus giving us our security that we're looking for.

However, there is a caveat which needs to be mentioned. The entire security of the encryption rests solely on the entropy of your eCryptfs passphrase. If that passphrase does not have sufficient entropy to withstand a sophisticated attack from a well-funded organization (our global adversary), then all bets are off. Essentially, this eCryptfs solution is acting like a "master password", and all encryption strengths rests on your ability to use a strong password defined by Shannon entropy. Current best-practice to guard against an offline password cracking attack, is to pick a password with at least 128-bits of entropy. You can use zxcvbn.js from Dropbox to estimate your passphrase entropy, which I have installed at http://ae7.st/ent/ (no, I'm not logging passphrases- save the page offline, pull your network cable and run it locally if you don't believe me).

Opera, VPNs, and Security

Aaron Toponce — Fri, 22 Apr 2016 13:30:43 +0000

Yesterday, Opera announced that they are bundling a VPN with the latest release of their browser. This is what the release says:

Why we are adding free VPN in Opera

Bringing this important privacy improvement marks another step in building a browser that matches up to peopleâ€s expectations in 2016. When you think about it, many popular options offered by desktop browsers today were invented (quite frequently by Opera) many years ago. The innovation energy in the industry has been recently so focused on mobile, even if the desktop is still thriving.

In January, we were reviewing our product plans, and we realized that people need new features in order to browse the web efficiently in 2016. It also became apparent to us that what people need are not the same features that were relevant for their browsers ten years ago. This is why we today have more engineers than ever before working on new features for our desktop browser.

So far we have the native ad blocker. And, weâ€re introducing another major feature in just a matter of a few weeks; a native, unlimited and free VPN client, right inside your browser!

Enhanced privacy online with Operaâ€s free VPN

According to Global Web Index*, more than half a billion people (24% of the worldâ€s internet population) have tried or are currently using VPN services. According to the research, the primary reasons for people to use a VPN are:

â€“ To access better entertainment content (38%)
â€“ To keep anonymity while browsing (30%)
â€“ To access restricted networks and sites in my country (28%)
â€“ To access restricted sites at work (27%)
â€“ To communicate with friends/family abroad (24%)
â€“ To access restricted news websites in my country (22%)

According to the research, young people are leading the way when it comes to VPN usage, with almost one third of people between 16-34 having used a VPN.

Better than traditional VPNs

Until now, most VPN services and proxy servers have been limited and based on a paid subscription. With a free, unlimited, native VPN that just works out-of-the-box and doesnâ€t require any subscription, Opera wants to make VPNs available to everyone.

Thatâ€s why Operaâ€s built-in free VPN feature is easy to use. To activate it, Mac users just need to click the Opera menu, select â€œPreferencesâ€ and toggle the feature VPN on, while Windows and Linux users need to go to the â€œPrivacy and Securityâ€ section in â€œSettingsâ€ and enable VPN there. A button will appear in the browser address field, from which the user can see and change location (more locations will appear later), check whether their IP is exposed and review statistics for their data used. Itâ€s free and unlimited to use, yet it offers several must-have options available in paid VPNs, such as:

Hide your IP address â€“ Opera will replace your IP address with a virtual IP address, so itâ€s harder for sites to track your location and identify your computer. This means you can browse the web more privately.

Unblocking of firewalls and websites â€“ Many countries, schools and workplaces block video-streaming sites, social networks and other services. By using a VPN you can access your favorite content, no matter where you are.

Public Wi-Fi security â€“ When youâ€re surfing the web on public Wi-Fi, intruders can easily sniff data. By using a VPN, you can improve the security of your personal information.

There were a couple things that stuck out to me rather quickly when reading this press release:

Is it a true VPN, or just an HTTP proxy?

If either a VPN or an HTTP proxy, how is it handling DNS requests?

If an HTTP proxy, is the request through a transparent TLS connection to Opera?

Why is the press release specifically absent about logs and tracking?

Well, some of these questions have been answered. First, it's not a true VPN. Instead, it's just an HTTP/HTTPS proxy. Here's the details:

How the â€œVPNâ€ works

Once the user enables the feature in settings, Opera VPN sends API requests to https://api.surfeasy.com to obtain credentials and proxy IPs. The browser then talks to a proxy like de0.opera-proxy.net, and its IP address can only be resolved from within Opera when the VPN feature is turned on. Itâ€s an HTTP/S proxy that requires authentication.

When the Opera browser with enabled VPN loads a page, it sends many requests to de0.opera-proxy.net with a Proxy-Authorization request header.

The Proxy-Authorization header decoded:
CC68FE24C34B5B2414FB1DC116342EADA7D5C46B:9B9BE3FAE67
4A33D1820315F4CC94372926C8210B6AEC0B662EC7CAD611D86A3

Since weâ€re talking about a proxy, these credentials can be used with de0.opera-proxy.net even when connecting from a different machine. This means that if you use the proxy on a computer with no Opera installed, youâ€ll get the same IP as when using Operaâ€s VPN.

From this, we can learn that it's not a VPN at all. In fact, it's not even deploying a TLS tunnel for the HTTP/S proxy. So, traditional HTTP requests will still be in the clear, just with a different target. So while a school or library might be filtetring requests based on DNS, this HTTP/S proxy in Opera doesn't address more active smart filtering based on content.

Unfortunately, Help Net Security also suggests the use of a general VPN service provider (emphasis mine):

"What Opera offers is not a VPN as such. It's just a proxy for the browser. You still need a full VPN if privacy is what you care about (and you should care about your privacy). Other tools you use, including for example email clients like Outlook, wonâ€t use this 'VPN'," Å paÄek told Help Net Security.

VPN service providers are scary. Sven Slootweg posted a "Don't use VPN services" Gist where he addresses some real concerns with using VPN service providers (I don't agree with a couple points):

VPN service providers log connections and other metadata.

VPN service providers have full accounting and payment information of their customers.

VPNs really are just glorified proxies, and don't provide any meaningful security or privacy.

VPNs don't obfuscate your IP address like Tor, and your IP address is meaningless to trackers anyway.

VPN service providers exist, because it's easy money.

I don't fully agree with a couple points (IP addresses are extremely valuable to trackers), but I think the overall topics Sven is trying to drive home, are the following: know how VPNs work, who has access to data at the VPN endpoint(s), and your security and privacy risks when using a VPN. There are valid times when using a VPN. Data is encrypted between your VPN client and the provider, so it is an easy way to get around restrictive firewalls, which you would think Opera would be trying to address with their HTTP/S proxy. You may also need to access your corporate internal network when "on the road", in which case using your corporate VPN server is needed.

But in both cases, understand the security and privacy concerns when using the VPN. Your VPN provider isn't going to go to jail for you. If the FBI catches unsavory traffic coming out of the VPN provider, you can rest assured they'll give the authorities all the logging, account, and payment information to comply with the request. You can rest assured that if your employer catches you breaking policy with the VPN, you will lose your VPN access, and possibly your job.

So, what to do? Well, realistically, if you want to obfuscate your traffic dynamically, security, and pseudoanonymously, then use Tor. Install a Tor client on your machine, install a Tor proxy extension in your browser, and when you want to get around restrictive firewalls, flip the proxy switch, and get on Tor.

Of course, Tor isn't a security and privacy panacea. You still need to understand the risks associated with using Tor. For example, the extension you installed may not tunnel DNS requests through Tor. Of course, HTTP traffic is still in the clear when it leaves the Tor exit relay. Tor clients and extensions may contain vulnerabilities that reveal metadata about you. Basically, don't be ignorant or stupid with your Tor connection.

Regardless, I think we can take a few things away from this post:

The Opera VPN is just an HTTP/S proxy.

Opera is very likely logging all your traffic.

Your Opera VPN browsing habits are likely unique enough for Opera to identify you.

VPN service providers should be avoided.

VPN service providers also are likely logging all your traffic.

Your VPN service provider won't go to jail for you.

When in doubt, use Tor, just understand the risks.

Tor and the CloudFlare Problem

Aaron Toponce — Sun, 17 Apr 2016 19:29:32 +0000

Before I go anywhere with this post, let me make three things very clear:

I do not work for CloudFlare.

I work for a small local ISP in Utah.

I have been using Tor probably almost as long as many of you have been alive.

I first blogged about Tor in 2006. I had discovered it around 2004, only a couple of years after it's first release. I had used it as a way to prevent my ISP and my employer from tracking what I'm doing with my Internet connection. I would setup a simple SOCKS proxy in Firefox, then switch to it when I wanted to get on the Tor network, and switch away from it when I didn't. Oh, and you think latencies are bad on Tor now? You should have been on it back then.

Here is a metrics graph showing the time it took to download a 50 KiB file over the Tor network. Unfortunately, they don't have the data back when I started using the network, but you get a rough idea of what it was like:

This makes a good deal of sense, because back then, ISPs didn't provide a lot of bandwidth to customers (it can be argued they still don't), and there wasn't a lot of exit nodes in the Tor network to handle the bandwidth (again, it can be argued there still isn't enough):

Spend some time on metrics.torproject.org looking over the historical data, and you'll get a good sense that using Tor in 2004 was a lot like getting data over dial-up. It was anything but pleasant.

What's the point? The point is, that while things can still be improved (we need more exit nodes, and we need more bandwidth on each exit node), the Tor network latencies, bandwidth, and relays is in a good position compared to 12 years ago when I started using it. So running large-scale attacks through the network is now practical.

So, where does CloudFlare fit into this? CloudFlare deploys solving captchas when you wish to consume a service behind the CloudFlare CDN. For example, while connected to Tor, visit medium.com, and you will be presented with a captcha, similar to something like this:

This has gotten a lot of criticism from the "cypherpunk" millennials who feel that Tor access should be unrestricted. If you follow the "#dontblocktor" hashtag on Twitter, you will see the continued repeated criticism of CloudFlare deploying these captchas to Tor users on their CDN. Some of the arguments include:

Solving the captcha may only bring up another, repeatedly, never being able to consume the website in question.

Visually impaired users cannot solve the visual captchas.

Non-native English speakers will not be able to solve the audio version of the cpatcha.

People using browsers that disable JavaScript will not be able to reach the page.

There may be other security concerns where the choice of Tor is preferred over not using Tor.

No doubt, all captchas on the Web should be reconsidered. Personally, for JavaScript enabled browsers, I think forcing a proof-of-work puzzle onto the browser is transparent, and provides exactly the sort of rate-limiting needed for mitigating large-scale malicious attacks. For non-javascript puzzles, captchas seem to be the best alternative. But, I'm sure as a society, can can find alternatives to non-javascript browsers (such as network-based proof-of-work puzzles).

No doubt physical limitations, such as visual or audible impairments, can make solving a visual and audible captcha challenging, if not impossible. I don't have good solutions here except for JavaScript-based proof-of-work puzzles. But the real question that need to be addressed, is why is CloudFlare deploying captchas for Tor users?

CloudFlare addressed this due to the on-going criticism a select few on Twitter have giving the company. The blog post "The Trouble with Tor" basically comes down to the following:

You must pick two between: security, anonymity, and convenience.

CloudFlare is a large CDN that deals regularly with malicious traffic sourced from Tor exit relays.

Captchas are a compromise, allowing Tor users to remain anonymous, while also getting access to the website.

A CloudFlare CDN customer has an option in their control panel to whitelist Tor or captcha Tor.

CloudFlare is investigating "blind token" proof-of-work client puzzles for something long-term.

I don't see anything unreasonable here. As a system administrator and security engineer for XMission, I understand and sympathize with CloudFlare's stance toward captachas, even if I don't agree with the implementation of the captcha itself. I have had to fight off malicious Tor traffic from our network many times during my employment, such as DNS and NTP amplification attacks, HTTP POST DDoS attacks, SQL injection and XSS attacks, and many others.

So, even as CloudFlare put it in their reasonable post, how do you allow honest Tor users with high degrees of convenience to consume the website while also minimizing and proactively mitigating malicious Tor traffic?

Again, I don't care for captchas, and wish they would die in a fire. But, what should CloudFlare do? Should they abandon the captcha altogether? If so, how should they proactively prevent malicious Tor traffic from negatively impacting their customer base? It's easy and knee-jerky to post screenshots to Twitter with the "#dontblocktor" hashtag, and shame CloudFlare and the customer using the CDN. I don't think that's the right approach, personally (nevermind that a captcha isn't a block (yes, semantics are important)). I'm curious how many of those who are reacting to CloudFlare captchas are actual system or network administrators that have to deal with these attacks. Instead, I would try to architect solutions to the problem.

Personally, I see the following:

Consume CloudFlare without Tor. There are no captchas, but you sacrifice a level of anonymity.

Consume CloudFlare behind Tor, but understand the compromise you are making to solve captchas sacrificing convenience.

Consume CLoudFlare beind a VPN, thus providing both anonymity and convenience.

If it really bothers you that you have to solve a captcha to reach a CloudFlare website, then rather than shaming CloudFlare, it might be worth your time to reach out to the site operator, and let them know about whitelisting Tor. If they engage in conversation, they may not have been aware of the configuration option, or they may have reasons why they want you to solve the captcha. Either way, you've come out ahead without the knee-jerking of #dontblocktor.

I guess in conclusion, while I hate captchas as much as the next guy, what would you do if you were employed by CloudFlare and in charge of this problem? What is a reasonable solution to keeping customers happy by mitigating malicious Tor traffic while also allowing honest Tor users to consume the website with high levels of convenience? Let's engage in discussion about how to create and architect these solutions, so we get as many people happy as possible- CloudFlare network admins, customers, and clients.

A final note about the term "block". The CloudFlare captcha is not blocking you from the reading the website. Instead, it's rate-limiting you. Some will argue that you get caught in endless captcha loops, consistently solving them over and over, never to actually reach the service. Personally, I have never encountered this, but others swear it exists. At most, I've had to solve 3 captchas in a row, usually because I did not solve them quick enough. I guess the effect is the same, but as already mentioned, the "#dontblocktor" hash tag is a knee-jerk, and incorrectly placed. Semantics are important, because CloudFlare is not actually blocking Tor, like Akamai does with "Access denied". It's one thing to provide a 502 HTTP error, it's quite another to rate limit requests.

Two OCB Block Cipher Mode Patents Expired Due To Nonpayment

Aaron Toponce — Fri, 01 Apr 2016 04:21:50 +0000

Peter Gutmann on the "[Cryptography]" mailing list wrote some thoughts about the impending crypto monoculture of all-things-Bernstein that seems to be currently sweeping the crypto world. In his post, he mentions the following (emphasis mine):

The remaining mode is OCB, which I'd consider the best AEAD mode out there (it shares CBC's graceful-degradation property in which reuse or misuse of the IV doesn't lead to a total loss of security, only the authentication property breaks but not the confidentiality). Unfortunately it's patented, and even though there are fairly broad exceptions allowing it to be used in many situations, the legal minefield that ensues makes it untouchable for most potential users. For example does the prohibition on military use cover the situation where an open-source crypto package is used in a vendor library that's used in a medical insurance app that's used by the US Navy, or where banking transactions protected by TLS may include ones of a military nature (both of these are actual examples that affected decisions not to use OCB). Since no-one wants to call in lawyers every time a situation like this comes up, and indeed can't call in lawyers when the crypto is several levels away in the service stack, OCB won't be used even though it may be the best AEAD mode out there.

Dr. Matthew Green also wrote about authenticated encryption and block cipher modes. He had this to say about OCB mode (emphasis mine):

In performance terms Offset Codebook Mode blows the pants off of all the other modes I mention in this post. It's 'on-line' and doesn't require any real understanding of Galois fields to implement** -- you can implement the whole thing with a block cipher, some bit manipulation and XOR. If OCB was your kid, he'd play three sports and be on his way to Harvard. You'd brag about him to all your friends.

I've known that OCB mode was patented, and as a result, why it has not been included in OpenSSL and other cryptographic protocol implementations. Peter said it correctly, it is a legal minefield. However, I wanted to read up on the patents, their design, operation, etc., mostly because I wanted to get out of doing the dishes. Discover my shock when I stumbled upon the following:

Patent 7,046,802 - Method and apparatus for facilitating efficient authenticated encryption

Status: Lapsed

Patent 7,200,227 - Method and apparatus for facilitating efficient authenticated encryption

Status: Lapsed

Not fully understanding what "Lapsed" means, I went to the official source: The United States Patent and Trademark Office website. I searched for those two patent numbers, and got the following:

Patent 7,046,802 - Method and apparatus for facilitating efficient authenticated encryption

Status: Patent Expired Due to NonPayment of Maintenance Fees Under 37 CFR 1.362

Status Date: 06-06-2014

Patent 7,200,227 - Method and apparatus for facilitating efficient authenticated encryption

Status: Patent Expired Due to NonPayment of Maintenance Fees Under 37 CFR 1.362

Status Date: 05-04-2015

Sure enough, Phillip Rogaway's first two patents regarding the OCB block cipher mode of encryption are expired due to nonpayment. I had to tweet this:

According to the USPTO, patents 7046802 and 7200227 regarding OCB block cipher encryption mode by Phillip Rogaway expired due to nonpayment.

— Aaron Toponce (@AaronToponce) April 1, 2016

PatentsÂ 7949129 (Method and apparatus for facilitating efficient authenticated encryption) and 8321675 (Method and apparatus for facilitating efficient authenticated encryption) are still valid however. I'm not sure how this applies to the Charanjit Jutla's IAPM mode patents now owned by IBM. Also, I don't know exactly what OCB modes patents 7,046,802 and 7,200,227 cover. OCB1 and OCB2? if someone can comment here, that would be great.

So, what does this mean for the cryptography world? It means that OCB covered by those two patents can now be implemented royalty-free, without fear of legal entanglements, in Free Software as well as proprietary and commercial software. OpenSSL, LibreSSL, BoringSSL, OpenPGP, Open Whisper Systems Signal, and so many other protocols, projects, and software should be able to implement OCB now.

All because Phillip Rogaway did not make the payments necessary to keep the patent valid. Two more software patents bite the dust.

Linux Kernel CSPRNG Performance

Aaron Toponce — Wed, 09 Mar 2016 02:34:23 +0000

I'm hardly the first one to notice this, but I was having a discussion in ##crypto on Freenode about the Linux kernel CSPRNG performance. It was mentioned that the kernelspace CSPRNG was "horrendously slow". Personally, I found the performance sufficient for me needs, but I decided to entertain his definition. I'm glad I did; I wasn't disappointed.

Pull up a terminal, and run the following command, passing 10GB of data from /dev/urandom to /dev/null:

$ dd if=/dev/urandom of=/dev/null bs=1M count=1024 iflag=fullblock 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 80.1537 s, 13.4 MB/s $ pv < /dev/urandom > /dev/null # cancel in a different terminal, unless you have "-S" 1.02GB 0:01:20 [13.3MB/s] [ < => ]

13.4 MBps of throughput for reading data directly out of the kernelspace CSPRNG. But, can we do better?

In the ##crypto channel, and as should be across development mailing lists, forums, groups, and discussion channels, I recommend that developers should not generally develop their own userspace CSPRNG. There are all sorts of pitfalls and traps waiting for you when you attempt it. Unless you know what you're doing, you could end up with a CSPRNG that isn't actually cryptographically secure (the "CS" in "CSPRNG").

However, what happens when I do actually run a userspace CSPRNG on the same machine? What can I expect out of performance? For example, I could implement AES-128 in CTR mode as a CSPRNG. In fact, we can do this with OpenSSL:

$ dd if=/dev/zero bs=10M count=1024 iflag=fullblock 2> /dev/null | openssl enc -aes-128-ctr -pass pass:"sHgEOKTB8bo/52eDszkHow==" -nosalt | dd of=/dev/null 20971520+0 records in 20971520+0 records out 10737418240 bytes (11 GB) copied, 15.3137 s, 701 MB/s $ openssl enc -aes-128-ctr -pass pass:"sHgEOKTB8bo/52eDszkHow==" -nosalt < /dev/zero | pv > /dev/null 31.9GB 0:00:34 [ 953MB/s] [ < => ]

700-950 MBps (notice that dd(1) incurs a performance penalty). That's 52-70x the speed of reading the kernelspace CSPRNG directly. That's more than a full order of magnitude faster. However, this is on a box with AES-NI. What about disabling AES-NI on the same box? How badly does it damage performance, and how does it compare to reading the kernelspace CSPRNG? We can use OpenSSL speed(1SSL) to benchmark algorithms.

First, with AES-NI enabled:

$ openssl speed -elapsed -evp aes-128-ctr 2> /dev/null (...snip...) The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-ctr 468590.43k 1174849.02k 1873606.83k 2178642.60k 2244471.47k

And with AES-NI disabled:

$ OPENSSL_ia32cap="~0x200000200000000" openssl speed -elapsed -evp aes-128-ctr 2> /dev/null (...snip...) The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-ctr 74272.21k 83315.43k 340393.30k 390135.47k 391279.96k

In this case, we see about a 5x performance improvement when using the AES-NI instruction set as compared to when not using it. That's significant. And even with AES-NI disabled in userspace, we're still outperforming /dev/urandom by almost 30x.

Interestingly enough, even the OpenBSD CSPRNG (different hardware than previously tested), which uses ChaCha20, outperforms the Linux CSPRNG (although its userspace CSPRNG with openssl(1) doesn't outperform kernelspace):

% dd if=/dev/urandom of=/dev/null bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes transferred in 13.630 secs (78775541 bytes/sec) % dd if=/dev/zero bs=1M count=1024 2> /dev/null | openssl enc -aes-128-ctr -pass pass:"sHgEOKTB8bo/52eDszkHow==" -nosalt | dd of=/dev/null 2097152+0 records in 2097152+0 records out 1073741824 bytes transferred in 33.498 secs (32052998 bytes/sec) % openssl speed -elapsed -evp aes-128-ctr 2> /dev/null (...snip...) The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-ctr 41766.37k 46930.74k 49593.54k 50669.32k 50678.33k

Roughly 78 MBps for OpenBSD on an Intel Xeon CPU running at 2.80GHz. Basically, six times the speed of the Linux kernel CSPRNG on an Intel Xeon CPU running at 2.67GHz.

So why is the Linux CSPRNG so slow? And, what can we do about it? Well, first, the kernel is using SHA-1 for its cryptographic primitive. In very loose terms, the CSPRNG hashes the input pool with SHA-1, and spits out the output to /dev/urandom. It's output is also its input, so its digesting its own output.

But, that's not all it's doing actually. The first function actually adds data into the input pool without increasing the entropy estimate. Then, after adding those bytes, the input pool is mixed with a Skein-like mixing function. Then some math is done to credit the entropy estimator, and the system is polled for data to add to the input entropy pool. Things like disk IO, CPU timings, interrupts, and user activity. Finally, we're ready to hash the data. This is done by extracting the data out of the input pool, and hashing it with SHA-1. But, we don't want any recognizable output, so the output is left-rotated and folded in half. Then, and only then, is the data ready for consumption.

W.T.F.

Unfortunately, the Linux kernel CSPRNG is not based on any sound theoretical security design. It's very much a hodge-podge home-brew design by developers who think they know what they're doing, when in reality, they don't. In 2013, a security audit and analysis was performed on the Linux kernel CSPRNG (PDF), and concluded that not only is it not robust, but it has some weaknesses:

In the literature, four security notions for a PRNG with input have been proposed: resilience (RES), forward security (FWD), backward security (BWD) and robustness (ROB), with the latter being the strongest notion among them.

(...snip...)

Distributions Used in Attacks based on the Entropy Estimator As shown in Section 5.4, LINUX uses an internal Entropy Estimator on each input that continuously refreshes the internal state of the PRNG. We show that this estimator can be fooled in two ways. First, it is possible to define a distribution of zero entropy that the estimator will estimate of high entropy, secondly, it is possible to define a distribution of arbitrary high entropy that the estimator will estimate of zero entropy. This is due to the estimator conception: as it considers the timings of the events to estimate their entropy, regular events (but with unpredictable data) will be estimated with zero entropy, whereas irregular events (but with predictable data) will be estimated with high entropy.

(...snip...)

As shown in Section 5.7, it is possible to build a distribution D0 of null entropy for which the estimated entropy is high (cf. Lemma 3) and a distribution D1 of high entropy for which the estimated entropy is null (cf. Lemma 4). It is then possible to mount attacks on both /dev/random and /dev/urandom, which show that these two generators are not robust.

(...snip...)

We have proposed a new property for PRNG with input, that captures how it should accumulate the entropy of the input data into the internal state. This property actually expresses the real expected behavior of a PRNG after a state compromise, where it is expected that the PRNG quickly recovers enough entropy. We gave a precise assessment of Linux PRNG /dev/random and /dev/urandom security. In particular, we prove that these PRNGs are not robust. These properties are due to the behavior of the entropy estimator and the mixing function used to refresh its internal state. As pointed by Barak and Halevi [BH05], who advise against using run-time entropy estimation, we have shown vulnerabilities on the entropy estimator due to its use when data is transferred between pools in Linux PRNG. We therefore recommend that the functions of a PRNG do not rely on such an estimator.

Finally, we proposed a construction that meets our new property in the standard model and we showed that it is noticeably more efficient than the Linux PRNGs. We therefore recommend to use this construction whenever a PRNG with input is used for cryptography.

TL;DR? The Linux CSPRNG does not meet the definitions of a secure CSPRNG per the PDF. It's not that it's theoretically broken, it's just not theoretically secure either. It's really nothing theoretically at all. This isn't great.

A replacement for random.c in the kernel would be to ditch the homebrew entropy collection, mixing, and output mangling, and instead, stick with AES-128 in CTR mode. Of course, as per the PDF, the entropy collectors need serious work, but if AES-128-CTR was deployed as the CSPRNG instead of SHA-1, then the generator could take advantage of hardware AES performance, which as I've shown, is exceptionally superior. It's frustrating, because the kernel already ships AES, so the code is already there. It's just not being utilized.

The Linux kernel could have 1 GBps in CSPRNG output, but is deliberately choosing not to. That's like having a V12 turbo-charged sleeper, without the turbo, and only firing on 3 of the 12 cylinders, with a duct taped muffler on the back.

Why does 1 GBps of performance matter? How about wiping hard drives or secure data removal in general? With 20 MBps, we can't even saturate a single drive in IOPS. With 1 GBps, we could saturate many simultaneously. As someone who wipes old employee workstations when they leave the company, backup servers with dozens of drives, or old decommissioned hardware, I see great benefit here.

Or, how about HTTPS web sites for a shared web hosting provider? I have seen countless times HTTPS and SSH connections lag due to waiting on the CSPRNG. Not that it's being intentionally blocked, but because the load is so intense on the server, it just can't generate enough cryptographic randomness to keep up with requests.

I'm sure there are plenty of other examples where end userspace applications could benefit with improved performance of the CSPRNG. And, as shown, it can't be that difficult to implement correctly. The real question is, of course, who will do the work and submit the patch?

Cryptographic Hashing, Part I- Introduction

Aaron Toponce — Tue, 08 Mar 2016 01:44:25 +0000

Introduction

Lately, I've been seeing some discussion online about cryptographic hashing functions, along with some confusion between a cryptographic digest, a cryptographic signature, and a message authentication codes. At least in that last post, I think I did well defining and clarifying the differences between those terms, but I also feel like I could take this discussion a lot further. So, I decided to dedicate a series to generic cryptographic hashing functions, which will include building compression frameworks with security proofs, specific implementations of cryptographic hashing functions, and some implementations of these functions. So, without further ado, let's get started.

Collisions

When we talk about a hashing function (cryptographic or otherwise), we are referring to any function that can take an arbitrary length of data, and compress it into a fixed-length digest. Typically, this digest is called a "fingerprint", a "checksum", or a "hash". The goal, is that any time we input the same data, our function outputs the same digest. Further, it's important that not only can I produce that digest, but anyone can produce the same digest. This gets us prepared for the Random Oracle, but we still have some ground to cover first.

Because our hashing function has a fixed length output, say 128-bits, then an ideal function would map every input to one of those outputs. In other words, our function maps an element in the domain (our data to be hashed) to exactly one element in the range (our actual hash). So, if our function produces 128-bit digests, then there are a total of 2^128 digests in the range. This means, that we have at least a one-to-one mapping of elements in the domain to elements in the range. Again, speaking about an ideal hashing function.

However, we know that there are many more inputs than just 2^128; there are infinitely many, actually. But think about it for a second. Take the number zero, and send it through our hashing function. Increment that number by 1, then hash that number. Continue in this manner, assuming infinite computing resources and infinite time, until you've hashed every number between 0 and 2^128. Ideally, you've produced exactly 2^128 unique digests. But, what happens when you now want to hash 2^128+1? Now we have what is called a collision. In other words, two distinct inputs was hashed to the same output. To put it formally:

Definition: A collision is when two distinct pieces of data hash to the same digest, checksum, or fingerprint.
Theorem: For any fixed-length hashing function, there are infinitely many collisions.
Proof: This can be proven using the pigeon-hole principle. Given a fixed-length hashing function of n-bits of output, hashing n+1 inputs from the domain will produce a collision in the range. As n tends to infinity, the collisions tend to infinity. Q.E.D.

I don't think I need to tell you how much larger infinity is to 128-bits. As a result, collisions are overwhelming. In fact, would you like to see a collision in practice? Below are 2 different hexadecimal strings. The differences are very subtle, but they indeed distinct (emphasized in bold red). Here, we'll take the two strings, and hash them with the known MD5 algorithm. Then, just to show I'm not cheating, we'll hash the same strings with SHA-1. While we produce a collision in MD5, we have distinct digests with SHA-1. Go ahead, and verify that you get the same results.

$ INPUT1=d131dd02c5e6eec4693d9a0698aff95c2fcab58712467eab4004583eb8fb7f89\ 55ad340609f4b30283e488832571415a085125e8f7cdc99fd91dbdf280373c5b\ d8823e3156348f5bae6dacd436c919c6dd53e2b487da03fd02396306d248cda0\ e99f33420f577ee8ce54b67080a80d1ec69821bcb6a8839396f9652b6ff72a70 $ INPUT2=d131dd02c5e6eec4693d9a0698aff95c2fcab50712467eab4004583eb8fb7f89\ 55ad340609f4b30283e4888325f1415a085125e8f7cdc99fd91dbd7280373c5b\ d8823e3156348f5bae6dacd436c919c6dd53e23487da03fd02396306d248cda0\ e99f33420f577ee8ce54b67080280d1ec69821bcb6a8839396f965ab6ff72a70 $ printf "$INPUT1" | xxd -r -p | md5sum 79054025255fb1a26e4bc422aef54eb4 - $ printf "$INPUT2" | xxd -r -p | md5sum 79054025255fb1a26e4bc422aef54eb4 - $ printf "$INPUT1" | xxd -r -p | sha1sum a34473cf767c6108a5751a20971f1fdfba97690a - $ printf "$INPUT2" | xxd -r -p | sha1sum 4283dd2d70af1ad3c2d5fdc917330bf502035658 -

Crazy, right? With an ideal hashing function, it should be at least as difficult as a brute force search to find these collisions, and it should take searching an entire 128-bit domain to find a collision. Unfortunately, however, finding blind collisions with a brute force search turns out to be much faster, thanks to the Birthday Paradox. The Birthday Paradox says the following:

In a room of just 23 people, there is a 50% probability that at least two of them share the same birthday. In a room of just 75 people, there is a 99.9% probability that at least two of them share the same birthday.

Wait, what? Uhm, last I checked, there are 366 days days in a year, assuming leap year. Soooo, if there are 23 people in a room, then there should be a 23/366, or about a 6% probability that two people share the same birthday. Unfortunately, this isn't how it works. There may be a 6% chance someone shares your birthday, but there is a 50% chance two arbitrary people share the same birthday. Now do you see the problem? Not only must you compare your birthday to everyone, but so must everyone else. This is a case of permutations. So, with 23 people in the room, there are actually 253 possible comparisons that must be made (23*22/2). The math gets a little hairy, and to be honest, it's a bit outside the scope of this post, and this series (it's going to be long enough as it is). Refer to the Wikipedia article if you want to work through the theory and the proof.

We can use this Birthday Paradox to work out an attack on finding two distinct inputs that produce an identical digest. This is called the Birthday Attack, and it's the primary driver in finding collisions. The attack basically says something like this:

To find a collision in a n-bit range with approximately 50% probability, you need to only search the the square root of 'n' of elements in the domain.

So, for a 128-bit digest (2^128 possible distinct outputs), using the Birthday Attack, I only need to search 2^64 possible inputs to have approximately a 50% probability that I have found a collision. If you don't think 2^64 is very small, the bitcoin network is currently mining 2^64 SHA-256 digests about every 20 seconds.

Blind, preimage, and second preimage collisions

Armed with this knowledge, we can now formalize some definitions of collision attacks. This might be confusing, so I'll define it first, then give some examples.

Collision attack:

A blind search, where two distinct inputs produce the same digest.

Preimage attack:

A search to find an input that matches a defined digest.

Second preimage attack:

A search to find a second file that matches the digest of a defined file.

Let's break these down individually. A collision search is literally a blind search, without any respect to inputs or outputs. You don't know what the inputs will be nor do you know what their outputs will be. You only know that you have found two distinct inputs that collide to the same output, all of which is entirely arbitrary.

A preimage attack is where you have a digest in your possession, but you would like to find an input that matches it. In this case, while the input is completely arbitrary, the output is static. For example, suppose you have the 256-bit hexadecimal digest "ec58d903a9f9dcc9d783da72401b1c94fc8fb9d9623d7141b8b90997382088f9". A preimage attack would be successfully finding the input that produced it. In this case, it was "Cz3eJlm4I2I2rHt8hioZ7evonLyukwlz".

A second preimage attack means having both the input "Cz3eJlm4I2I2rHt8hioZ7evonLyukwlz" and its 256-bit hexadecimal digest "ec58d903a9f9dcc9d783da72401b1c94fc8fb9d9623d7141b8b90997382088f9", and finding a second input that produces that same digest.

Usually, when breaking cryptographic hash functions, the first thing to break is the compression function, which I'll cover in later posts. Once the compression function is broken, the next step is to break searching for blind collisions. This is generally done by analyzing the weaknesses in mathematics, find bias in the output, observe the quality of the avalanche effect, and so forth. You eventually learn where the hashing function is weak, and where you can take "shortcuts" to get to your goal. Eventually, the algorithm is broken to the point that finding blind collisions is practical. MD5 is broken in this regard.

After breaking the compressing function, and weakening the algorithm to the point of practical collision attacks, preimage attacks become the next focus of analysis. However, when the compression function is broken, such as in the case of SHA-1, it's a strong sign to start moving away from the algorithm, long before you find collisions. So, analysis tends to slow down after collisions have been found, because no one should be using the function anymore. This also means continuing to find second preimage collisions gets even less attention.

Avalanche Effect

The final property of cryptographic hashing functions that needs to be addressed is the "avalanche effect". It is absolutely critical in cryptographic hashing functions that even though inputs may be sequential, their outputs do not show that to be the case. For example, consider the SHA-256 of the first 10 digits:

$ for I in {1..10}; do printf "$I: "; printf "$I" | sha256sum -; done 1: 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b - 2: d4735e3a265e16eee03f59718b9b5d03019c07d8b6c51f90da3a666eec13ab35 - 3: 4e07408562bedb8b60ce05c1decfe3ad16b72230967de01f640b7e4729b49fce - 4: 4b227777d4dd1fc61c6f884f48641d02b4d121d3fd328cb08b5531fcacdabf8a - 5: ef2d127de37b942baad06145e54b0c619a1f22327b2ebbcfbec78f5564afe39d - 6: e7f6c011776e8db7cd330b54174fd76f7d0216b612387a5ffcfb81e6f0919683 - 7: 7902699be42c8a8e46fbbb4501726517e86b22c56a189f7625a6da49081b2451 - 8: 2c624232cdd221771294dfbb310aca000a0df6ac8b66b696d90ef06fdefb64a3 - 9: 19581e27de7ced00ff1ce50b2047e7a567c76b1cbaebabe5ef03f7c3017bb5b7 - 10: 4a44dc15364204a80fe80e9039455cc1608281820fe2b24f1e5233ade6af1dd5 -

Notice that there is no clear indication on sequential digests. For all practical purposes, they are truly randomized output, despite the sequential input (merely flipping a single bit on each input from the previous). However, can we formally define the avalanche effect? What would be ideal is that with each bit change on the input, every bit in the digest output has as close to a 50% chance of being flipped as theoretically possible.

I'll talk more about "rounds" in future posts when I talk about specific implementations and designs. Suffice it to say that a cryptographic hashing function will iterate through the compression functions a certain number of times, before outputing the state. On each round, the bits in the output should each have a 50% chance of being flipped. So, on each output of each round iteration, close to half of the bits have been flipped in some pseudorandom manner. After a certain number of rounds, the final output should be indistinguishable to true random noise.

So, how about this as a formal definition:

When a single input bit is flipped, each output bit should change with a 50% probability.

Of course, the cryptographic strength doesn't rest solely on the avalanche effect. There are mathematical properties that determine that. But, the output should be completely unpredictable. You could apply the "next bit test", in that there is no algorithm you could produce that would determine the next state of the next bit, without actually compromising the state of the machine (this is a test held to cryptographically secure pseudorandom number generators).

Unfortunately, all we have to test the avalanche effect is standard randomness tests, such as the chi-square distribution, Monte Carlo for Pi, and the Birthday Paradox, among others. This doesn't say anything about the cryptographic strength of the hashing function, but says a lot about randomness properties (non-cryptographic hashing functions can also exhibit strong randomness qualities).

There are a couple software utilities we can use to test and analyze cryptographic hashing functions. First, we have standard randomness tests, such as Dieharder and the FIPS 140-2 suite. But, for something more specific on analyzing cryptographic primitives, I would recommand Cryptol. On the one side, this isn't an out-of-the-box software solution for just running a battery of tests and analysis. It is actually a domain-specific language that will require a bit of a learning curve. On the other hand, it's Free Software, and you'll probably learn more about cryptanalysis with this tool, than just playing with randomness tests.

Conclusion

This was just a primer post to get you thinking about cryptographic hashes, specifically thinking about their output, and the task of finding collisions. The rest of the posts in the series will cover specific functions such as MD5, SHA-1, -2, and -3, as well as some others. We'll talk about hashing constructions, and where you'll find cryptographic functions in practice (I think you'll be surprised). I may even throw in a post or two about random oracles, and how we want cryptographic hashing functions to not only imitate them, but be proven secure under the "Random Oracle Model".

Regardless, this post will get you started, and hopefully excited for what is to come.

Manual Authenticated File Encryption With OpenSSL

Aaron Toponce — Sat, 27 Feb 2016 17:37:52 +0000

One thing that bothers me about OpenSSL is the lack of commandline support for AEAD ciphers, specifically AES in CCM and GCM block modes. Why does this matter? Suppose you want to save an encrypted file to disk, without GnuPG, because you don't want to get into key management. Further, suppose you want to send this data to a recipient or store it on a server outside of your full control. The authenticated encryption is important, otherwise the ciphertext is malleable and vulnerable to bit flipping.

So, when you get to the shell, you may try using AES in GCM mode with OpenSSL's "enc(1)" command, only to be left wanting. Here, we generate a key from /dev/urandom, convert it to hexadecimal, and provide the key as an argument on the command line.

$ LC_CTYPE=C tr -cd 'abcdefghjkmnpqrstuvwxyz23456789-' < /dev/urandom | head -c 20; echo sec2tk24ppprcze33ucs $ echo sec2tk24ppprcze33ucs | xxd -p 73656332746b323470707072637a6533337563730a $ openssl enc -aes-256-gcm -k 73656332746b323470707072637a6533337563730a -out file.txt.aes -in file.txt AEAD ciphers not supported by the enc utility $ echo $? 1

So, rather than using GCM, however, we can build the authentication tag manually with HMAC-SHA-512, which OpenSSL does support. This means using a non-authenticated block cipher mode, such as CTR, as a first step, then authenticating the ciphertext manually as a second step.

Using our same password from the previous example, we'll do this in two steps now:

$ openssl enc -aes-256-ctr -k 73656332746b323470707072637a6533337563730a -out file.txt.aes -in file.txt $ openssl dgst -sha512 -binary -mac HMAC -macopt hexkey:73656332746b323470707072637a6533337563730a -out file.txt.aes.mac file.txt.aes

Now you have three files- your plaintext file, your AES encrypted ciphertext file, and your HMAC-SHA-512 authentication file:

$ ls -l file.txt* -rw-rw-r--. 1 aaron aaron 1050 Feb 27 10:26 file.txt -rw-rw-r--. 1 aaron aaron 1066 Feb 27 10:27 file.txt.aes -rw-rw-r--. 1 aaron aaron 64 Feb 27 10:28 file.txt.aes.mac

When sending or remotely storing the "file.txt.aes" file, you'll want to also make sure the "file.txt.aes.mac" authentication file is accompanied with it. Unfortunately, the OpenSSL dgst(1) command does not support verifying message authentication codes, so you'll have to script this manually. So, you'll need to generate a second file, maybe "file.txt.tmp.mac", then compare the two. If they match, you can decrypt the "file.txt.aes" ciphertext file. If not, discard the data.

This isn't elegant, and I wish enc(1) supported AEAD, but as it stands, it doesn't. So, you'll have to stick with doing things manually. However, this is something simple enough to script, and provides both data confidentiality and authenticity, which should be the goal of every ciphertext.

Digest Algorithms in Google Spreadsheets

Aaron Toponce — Fri, 26 Feb 2016 23:51:46 +0000

I can't imagine there are a lot of uses for using digest algorithms in spreadsheets, but I came up with one, and I really wished I had access to them. Seeing as though most spreadsheet applications don't ship one, I figured I would create my own.

Mostly, I use Google for my document processing and spreadsheet use, and I had a spreadsheet of Louis L'Amour books. My grandfather gave me his entire selection of Louis L'Amour books last year, and I made a goal to read them all during 2016. I grew up listening to them on audiotape in the truck when I was on the road with him laying carpet, heading up to the family cabin in Idaho, and other things, so, I have memories of many of the stories. It will be fun to read them.

So, what do digest algorithms have to do with Louis L'Amour novels? Well, after the spreadsheet was created to track what I've read and what I have left, as well as a pace (I have to be reading at least 60 pages every day), I wanted to start reading the books in a random order. Sure, I'll read the Sackett, Hopalong Cassidy, Talon & Chantry, and Kilkenny series first, but when I'm finished with the series, I want to read the novels in random order. Why? Because I don't want to get caught up in published year watching him change as a writer, or go in alphabetical order, because that's boring. Randomness is exciting!

Now I could have used the =RAND() function in the spreadsheet, but when I sort the columns, the numbers change. So, I need to copy and paste their values, then sort the columns. Besides, is RAND() even cryptographically secure (indistinguishable from true random noise)? Even better, I could just get ASCII data off of /dev/urandom and paste those results into the column, then sort off of that. But that requires using an external tool. However, I could also use a digest algorithm to calculate the digest of the book title, then sort by the digest. Because digest algorithms aren't part of the Google Spreadsheet default functions, my OCD kicked in, and I had to create one.

Here is what I came up with. You'll notice that MD2(), MD5(), and SHA1() are created, even if they're not cryptographically secure for today's modern cryptographic applications. However, in this specific use case, such as sorting columns, they are fine. Also, notice that SHA256(), SHA384(), and SHA512() exist, but not SHA224(). This is because "Utilities.DigestAlgorithm" does not export a "SHA_224" algorithm, which in my opinion, is just odd. MD4 is also not available, nor any of the SHA-3 functions. Regardless, all the digest algorithms supported by the API are available.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
// Cryptographic hash functions for use in Google Spreadsheets
// Use with =MD5(string)
// A string or cell (or concatenation of cells) can be provided
// Output in hex
// Released to the public domain

function MD2(s) {
var hexstr = '';
var digest = Utilities.computeDigest(Utilities.DigestAlgorithm.MD2, s);
for (i = 0; i < digest.length; i++) {
var val = (digest[i]+256) % 256;
hexstr += ('0'+val.toString(16)).slice(-2);
}
return hexstr;
}
function MD5(s) {
var hexstr = '';
var digest = Utilities.computeDigest(Utilities.DigestAlgorithm.MD5, s);
for (i = 0; i < digest.length; i++) {
var val = (digest[i]+256) % 256;
hexstr += ('0'+val.toString(16)).slice(-2);
}
return hexstr;
}
function SHA1(s) {
var hexstr = '';
var digest = Utilities.computeDigest(Utilities.DigestAlgorithm.SHA_1, s)
for (i = 0; i < digest.length; i++) {
var val = (digest[i]+256) % 256;
hexstr += ('0'+val.toString(16)).slice(-2);
}
return hexstr;
}
function SHA256(s) {
var hexstr = '';
var digest = Utilities.computeDigest(Utilities.DigestAlgorithm.SHA_256, s);
for (i = 0; i < digest.length; i++) {
var val = (digest[i]+256) % 256;
hexstr += ('0'+val.toString(16)).slice(-2);
}
return hexstr;
}
function SHA384(s) {
var hexstr = '';
var digest = Utilities.computeDigest(Utilities.DigestAlgorithm.SHA_384, s);
for (i = 0; i < digest.length; i++) {
var val = (digest[i]+256) % 256;
hexstr += ('0'+val.toString(16)).slice(-2);
}
return hexstr;
}
function SHA512(s) {
var hexstr = '';
var digest = Utilities.computeDigest(Utilities.DigestAlgorithm.SHA_512, s);
for (i = 0; i < digest.length; i++) {
var val = (digest[i]+256) % 256;
hexstr += ('0'+val.toString(16)).slice(-2);
}
return hexstr;
}

So, how do we apply it? Because this is a script which is applied to your spreadsheet, you need to add it manually every time you create a new spreadsheet document. Supposedly, you can permanently add it through the Chrome Web Store (I created a project that is currently pending review), but for the time being, copying/pasting works.

Navigate to "Tools > Script Editor", remove the default function Google provides, and add the code above. Save it as a project, then it will be available to your spreadsheet. Now you can use it.

For example,

To calculate the MD5 of cell "A1": =MD5(A1) To calculate the SHA1 of cells "A1" through "D1": =SHA1(CONCATENATE(A1:D1)) Retrieve the 12 left-most characters from a SHA512 digest of cells "A1" through "D1": =LEFT(SHA512(CONCATENATE(A1:D1)), 12)

I'm sure there are a few uses for digest algorithms in spreadsheets, they're just not very common; at least web searching for their use gives me very scant results. If you find these helpful, even if there are other solutions, I would be interested in how you used them in the comments below.

Oh, and here's my Louis L'Amour spreadsheet tracking my reading progress.

My Strange Tweets

Aaron Toponce — Thu, 18 Feb 2016 03:01:46 +0000

You may have noticed some tweets from me that look.... strange. Probably something like these:

UNYEf FXgOZ ILokj nIbFM qIlTr BRwQX iQciZ OtVhi GbUzj IGMVC SrOix sXHRZ TCtfJ #talon #cardciphers

— Aaron Toponce (@AaronToponce) February 17, 2016

1455741420: 198027fd95bb881b223161d0df1b325fea7dab7f #ripemd160 #unix #epoch

— Aaron Toponce (@AaronToponce) February 17, 2016

First, let me provide some background. When Twitter was announced, a couple Free Software developers got together to create a self-hosted Free Software alternative. They called that alternative "Identica", because it was hosted in Canada, and a way to establish your social identity. It made sense, and the Free Software and Open Source ecosystem ate it up. Within no time, it was a thriving online social network, involving mostly those from the Free Software and Open Source world, with all sorts of very influential developers and people creating accounts.

One account that seemed to catch the eye of many was @key. It posted what appeared to be MD5 checksums every 2 hours, regularly and consistently. Plenty of people were following the account, yet it wasn't following anyone. People replied to the tweets, asking what it was posting, who it was, why it was doing what it was doing, if it was a government account, etc. No one could figure it out, and if there were MD5 checksums, no one could reproduce them. It was a social enigma, and it kept people enthralled and engaged.

I thought this was exceptionally creative, and I was quite jealous that I didn't think of it first. The best I figured was that it was posting the timestamp of the tweet with a custom salt. At least, that is what I would have done. It couldn't be an MD5 of random data, otherwise, why not just post the random data? Or is that exactly what it is? So, instead, I decided to play with the Identica API and roll my own, using my own account. I had already setup the "Identica-Twitter bridge", so anything I posted to my Identica account would get posted to Twitter automatically.

But, I have to be different. So rather than a random digest that no one could figure out (I'm sure it's a timestamp), I wanted something a little more transparent. I started with taking the SHA-1 of the Unix epoch (the number of seconds since Jan 1, 1970 00:00.00) at 13:37 local time, because it's leet. This was easily accomplished with a bit of shell code:

$ EPOCH=$(date --date="today 13:37" +%s); printf "$EPOCH: "; printf "$EPOCH" | sha1sum - | cut -d ' ' -f 1

This was the first tweet:

1293223021: a7f8a4265df407c11a3b9471b5a163f4c0fe4873 #unix #epoch #sha1

— Aaron Toponce (@AaronToponce) December 24, 2010

Later however, I wanted something even more creative. I go by the online nick "eightyeight" on IRC, because I play the piano. However, some Asian cultures see the number "8" as lucky. With "Chinese" fortune cookies, I figured I would "encrypt" a fortune at 08:08 local time. Again, I decided to do this with a bit of shell code:

$ fortune -s -n 70 | gzip -c | base64 | rot13 | paste -sd ''

The first tweet to hit that was (testing the API, so this one actually wasn't on 08:08):

U4fVNB8Uux0NN3CBlRmBGf1G8ZxfXpyWIFuCYSLblxmCXAUwNtNtH04TTtNNNN== #rot13 #base64 #gzip #crypto

— Aaron Toponce (@AaronToponce) March 20, 2011

However, Identica started going downhill. First, we had big challenges fighting bot spam. Despite repeated bug reports and discussion on the network, very little change was happening in the code to combat the spam (for future reference, just use Hashcash tokens as a proof-of-work for form submissions). Then getting venture capital, and attempting to appeal to the mass market, things started changing. First it rebranded itself as "Status.Net", then we lost threaded replies. The API was no longer Twitter compatible (at least some things were different), and branding got real weird. Then it rebranded itself again under a completely new code rewrite as "pump.io", and that is the status today. At this last rebranding, the API was no longer functional, and my scripts stopped. I didn't want to work with the Twitter API, so I didn't bother setting it up again.

It wasn't until some time ago I decided to resurrect my cryptic tweets. However, I made some changes. Instead of using SHA-1, I decided to use RIPEMD-160. Although it hasn't had the mountains of analysis SHA-1 has had, RIPEMD-160 is still considered secure, although with its 160-bit digest size, the security margin might be a bit too slim for some. However, I stuck with the same Unix epoch timestamp automated at 13:37 local time.

Then, after developing my own playing card cipher, and refining it with the help of @timshadel, I decided to actually attempt a legitimate (if still insecure) cipher with Talon. It's still a fortune (BOFH style) and it's still published at 08:08 local time for the same reasons. If you want a crack at decrypting it, check out my playing card cipher repository at https://github.com/atoponce/cardciphers. There should be a new one every day, but it may be possible that the fortune is 1 character too long, and as a result, it doesn't get posted (I've accounted for this, but I'm sure I've missed something).

What's the point? Nothing more than just a bit of fun. It's probably not something you're interested in seeing on your timeline, and I don't blame you. Granted, there will be one of each every day. If you don't have a busy timeline, I guess it could get a bit old. But, I don't plan on stopping, nor using a separate account.

Checksums, Digital Signatures, and Message Authentication Codes, OH MY!

Aaron Toponce — Wed, 17 Feb 2016 05:01:16 +0000

I recently submitted a bug to the Vim project about its Blowfish encryption not using authentication. Bram Moolenaar, the lead developer of Vim, responded about using checksums and digital signatures. I hope he doesn't mind me using him as an example here, but I want to quote the relevant bits (emphasis mine):

The encryption is meant to avoid other people, who don't have the key, from reading the text. It does not have the goal of protecting manipulation of the text, that is something else. You could add a checksum even when not using encryption. I believe it's called signing.

Unfortunately, Bram is confusing checksums, digital signatures, and message authentication codes, all rolled up into one. I don't blame him. This is a topic that is not well understood by those not intimately familiar with cryptography. In a nutshell, each provide data integrity at the core. Where they differ is whether or not you're using encryption keys, and whether or not those encryption keys are symmetric or asymmetric. So, in this post, I would like to break it down.

Checksums

Checksums do not require any sort of encryption key. They are simply digests, or "fingerprints" that represent some data. When you download a piece of software from the Internet, there may be a file with an MD5, SHA-1, or SHA-256 hash of the file. This is the software vendor providing a way for you to verify that you got all the correct bits when the download completes.

For example, suppose you wish to download the latest Debian 8.3.0 amd64 ISO from https://mirrors.xmission.com/debian-cd/8.3.0/amd64/iso-cd/. Notice that there are the following files: MD5SUMS, SHA1SUMS, SHA256SUMS, & SHA512SUMS. Part of the SHA256SUMS file looks like this:

$ head SHA256SUMS 1dae8556e57bb04bf380b2dbf64f3e6c61f9c28cbb6518aabae95a003c89739a debian-8.3.0-amd64-CD-1.iso 89facfbb5039e49d4e3eeff1cca6ab55e9121ff46affeb46ed510c11731acf41 debian-8.3.0-amd64-CD-10.iso 7f6bc807d3636975374b937c2724353f7468ecd7a61e60f2a8b71f92eeefe629 debian-8.3.0-amd64-CD-11.iso bd99b7c274ea400b50960ab9e46dd23bad76f87574d2ceee1e8e43859fbd045b debian-8.3.0-amd64-CD-12.iso e85679304a509593526cffa77ff0d675329565eb4430444ee2c0d2cdd87842a8 debian-8.3.0-amd64-CD-13.iso 69f727bceb0460957bbd5023fe79749c6bf9f0e3a1b89945e6c63c6b3f04f509 debian-8.3.0-amd64-CD-14.iso d1dab389f8cb794013986d2da8a6dc72c0be8bc932fcc6d7291cb09b418724d5 debian-8.3.0-amd64-CD-15.iso 913b5d89322b500a02f699d44778901cb59aae909f09bff64963115143c2a6ca debian-8.3.0-amd64-CD-16.iso 0638aca6f59a8f5bec6d1cd4d272cea01758c2b2d6ec1412048ecb78ef684a77 debian-8.3.0-amd64-CD-17.iso 6f17742fbc82828f04da39f66647e958b0ac667cb4d2a40c9888c749680f1eb8 debian-8.3.0-amd64-CD-18.iso

So, when downloading "debian-8.3.0-amd64-CD-1.iso", I can use the sha256sum(1) command to verify the file:

$ sha256sum debian-8.3.0-amd64-CD-1.iso 1dae8556e57bb04bf380b2dbf64f3e6c61f9c28cbb6518aabae95a003c89739a debian-8.3.0-amd64-CD-1.iso

The digest matches, so the download was successful and all the correct bits exist. Another way would be to download the SHA256SUMS file, and use the "-c" switch for the utility to verify the checksum automatically, rather than you eyeballing it:

$ sha256sum -c SHA256SUMS debian-8.3.0-amd64-CD-1.iso: OK

The important thing to understand about checksums, is they are completely and totally anonymous. There is no secret shared between the server where I downloaded the software and myself, and there is no identity attached to the checksum. This means that anyone can change the original file and recalculate the checksum. So, if transferring data over the Internet, there is nothing preventing a man-in-the-middle attack from replacing the bits you're downloading with something else, while also replacing the checksum.

In other words, checksums provide data integrity, but they do not offer any sort of authentication. However, there are a number of checksum hashing functions, both cryptographically secure and not, such as CRC, MurmurHash, MD5, SHA-1, SHA-2, SHA-3, and so forth. For non-authenticated data integrity, a cryptographically secure hash function isn't always desirable, which is why non-cryptographic hash functions exist.

Digital Signatures

Digital signatures are a form of checksum, in that they provide data integrity, but they require asymmetric encryption to also provide authenticity. Digital signatures are away to attach an identity to the checksum. This implies a level of trust between you and the 3rd party, such as Debian with our example above. If you have met with the 3rd party, or dealt with them enough to establish some level of trust, then you can install the 3rd party's public key into your system. Then, when they provide data attached with a digital signature, you can verify that the data did in fact come from the 3rd party, and no other source.

Notice that a man-in-the-middle attack is no longer valid here, if I already have the 3rd party's public key installed on my system. Going back to our example with Debian, I have the Debian signing public key already installed. So, I can now download the MD5SUMS.sign, SHA1SUMS.sign, SHA256SUMS.sign, or SHA512SUMS.sign file, along with the checksums file I already downloaded, and verify that the checksums are those intended by Debian:

$ gpg --verify SHA256SUMS.sign gpg: assuming signed data in `SHA256SUMS' gpg: Signature made Sun 24 Jan 2016 11:08:33 AM MST using RSA key ID 6294BE9B gpg: Good signature from "Debian CD signing key " gpg: WARNING: This key is not certified with a trusted signature! gpg: There is no indication that the signature belongs to the owner. Primary key fingerprint: DF9B 9C49 EAA9 2984 3258 9D76 DA87 E80D 6294 BE9B

If we look at the contents of the SHA256SUMS.sign file, we get the following:

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABCAAGBQJWpRMhAAoJENqH6A1ilL6bYz4P/3ZNCR8N+rrlSSgTN/AkpSVt WXWg2BTflY3cPYmKK/osJUvLT7HTPDhabPiuQY2jJxrHYJhq5sCOrhbgc4eSRmIf IsSm7OxQ9TXqde4mg9DVsxmIRui/rVbhEjkAVu47A0eGDUrRxczgJUo14En3jO0Z qhXypCIN90y8HWaqy6OMe+eCsPyGxmXpWRT1XEH9tOX21wCAaxUl6ZHkiqNqdt8u Erojls77nlaBR/tvB9CHXTkUmqsocdYD+n5UsvtmLlYN0nz85b7NhrLEW2QtugLd MJngeI5eJvI4Hjyas0HfSlsdoBAvF+Uw3Dn9aHiTIeWVIeCYUKhdXmLww0dL0n95 jVuBSuMavQwOKKRGTbvG++RET9s/2U/G95wK0Vfx5fsf1neKVJgYf9q9iyObgcH8 dRLAqkgWJBNkvm9oXmpcy7jAq8jlXzDfaPz8plAyqDuIXoOSCHpJ5KAbAS1cYLIT 9U2cQLKTbCPrWJT5xZzOMuCPWu1CzfluDEafFsNzurWG5vCmFEJ+vV9strkeEIuX tFeKVDkkhVEZYQKSbIlidXBa/WP2Q0g1KvlKXb+nsnWDtWAjLUPD621F3ZjUcjlX aDPv3J+7kqfryA/7qYMVTH67KY3DwKIDKt6XtquxSf7HuYqEwXKIXp2De7zCCEqH csWVPFNUQyOdetIC/l/w =TjXr -----END PGP SIGNATURE-----

The details of a PGP signature aren't important for this blog post. Suffice it to say that it requires the signer's private key to produce the signature, and the signer's public key to verify it. The sender will hash the message with their private key, and append the signature to the message. When the recipient receives the message, they will separate the signature from the message, hash the message, and verify that the two signatures match using the sender's public key. If they do match, the recipient knows that only the sender signed the data, and no one else.

Because the private key is required to create the signature, and because only the 3rd party should have access to the private key, this means that a man-in-the-middle attack is no longer effective. A miscreant should not be able remove the signature, apply a new signature, and have the recipient still verify the signature as "good" from Debian, unless that miscreant also had access to Debian's private signing key.

To make an example of an existing software vendor, Arch Linux was under heat about this. A core developer strongly disagreed about digitally signed packages. They would provide their software from their repositories with MD5 checksums only. The packages were not digitally signed.

So, when your local Arch Linux installation would request packages from the Arch Linux software repository, unless served over HTTPS, a man-in-the-middle could interject their own bits with their own MD5 checksum. Your pacman(8) package manager would verify that the MD5 is valid, and proceed to install the software with root privileges, because that is what you told it to do. By also digitally signing the package with an Arch Linux signing key, this attack is no longer possible.

Eventually, Arch Linux fixed the vulnerability, and closed a very large security hole, by digitally signing their packages.

As with checksums, digital signatures really should be using a cryptographically secure hashing function as part of the protocol. This can include RIPEMD160, SHA-2, SHA-3, BLAKE2, Skein, and others. MD5 and SHA-1 are no longer considered cryptographically secure, and should not be used with digital signatures (thus why SHA-256 SSL certificates instead of SHA-1).

Message Authentication Codes

Finally, message authentication codes (also called "MAC tags") are another way to provide data integrity with authentication, but this time using symmetric encryption. Where digital signatures imply a physical identity behind the authentication, MAC tags provide anonymous authentication. Generally, symmetric keys don't have identities associated with them. They're usually short-lived and shared via complex key exchanges, such as the Diffie-Hellman key exchange.

A MAC tag is keyed, meaning a shared secret is used when calculating the digest. There are a number of different implementations of MACs, such as CBC-MAC, HMAC, UMAC, & Poly1305, among others. The differences between each of those isn't important for this post. What is important, is how they are calculated, and how they are used with encryption.

The sender of some message will apply a cryptographic hashing function to the message, and append (or prepend) the resulting digest to the message, and send the full payload off. Because both the ciphertext and the MAC were calculated with a shared secret key, a man-in-the-middle cannot strip the MAC tag and apply their own without knowing the shared secret. Because, when the recipient receives the payload, they will strip off the MAC tag, rehash the message with the same keyed hashing function, and see if the two MAC tags match. If they do match, the message can be acted upon. If they do not match, something happened to the data in transit, and the payload can be safely ignored.

There are three main ways to apply MAC tags to messages: encrypt-then-MAC, MAC-then-encrypt, and encrypt-and-MAC. The first, encrypt-then-MAC, is considered "best practice" for message authentication. First the message is encrypted, then the ciphertext is authenticated, and the resulting MAC is appended to the ciphertext. This provides both ciphertext and plaintext integrity. The big advantage to this approach, is that if the MAC tag does not match the newly calculated MAC tag during verification, the ciphertext does not need to be decrypted. This is the default approach with IPsec and modern versions of OpenSSH. RFC 7366 standardizes this for TLS (yet to be implemented by OpenSSL last I checked). Also an ISO/IEC 19772:2009 standard.

The next approach, MAC-then-encrypt, means authenticating the plaintext, appending the resulting MAC tag to the plaintext, and then encrypting the full plaintext and MAC tag payload. While this approach offers plaintext data integrity, it does not offer ciphertext integrity. As such, the ciphertext must be decrypted before the MAC tag can be verified. This is the default behavior in older versions of OpenSSH.

Finally, encrypt-and-MAC, means authenticating the plaintext first, then encrypting the plaintext. The resulting MAC tag is appended to the ciphertext. Again, like MAC-then-encrypt, this approach offers plaintext data integrity, but it does not offer ciphertext integrity. So, you must detach the MAC tag first, then decrypt the ciphertext, then verify if the MAC tag is valid. This is the default behavior with OpenSSL.

~~As I understand it, there are no known vulnerabilities with MAC-then-encrypt and encrypt-and-MAC MACs.~~ However, by having both ciphertext and plaintext integrity with encrypt-then-MAC, as well as not needed to decrypt the ciphertext on failure, is why encrypt-then-MAC is the preferred way to handle message authentication.

EDIT: If the symmetric encryption algorithm is vulnerable to a padding oracle attack, then due to the nature of encrypt-and-MAC, this authentication scheme is also vulnerable. This lies in the problem that only the plaintext is authenticated, so as a recipient, you cannot detect a modified ciphertext. Encrypt-then-MAC is the only way to avoid "cryptographic doom".

As with digital signatures, MACs should be calculated with cryptographically secure hashing functions, such as RIPEMD160, SHA-2, SHA-3, BLAKE2, Skein, etc. MD5 and SHA-1 would not qualify (although we could get into a discussion about HMAC-MD5 and HMAC-SHA1, but we won't).

Conclusion

No doubt, it's confusing to separate checksums from digital signatures from message authentication codes. Things get even a bit more hairy with blind signatures (used primarily in digital currencies) and Merkle trees (used primarily in peer-to-peer networks and copy-on-write filesystems), but they're special cases of the primary three functions discussed above. However, if you can get checksums, digital signatures, and message authentication codes cleared up, then you're that much closer to implementing cryptographic protocols correctly.

Bitcoin Mining Rate and Waste

Aaron Toponce — Sat, 30 Jan 2016 13:27:01 +0000

Recently, the Bitcoin mining rate surpassed 1 exahash per second, or 1 quintillion SHA-256 hashes per second.

If we do some quick math, we can determine the following:

If SHA-1 collisions can be found in 2^65.3 hashes, that's one SHA-1 collision found every 45 seconds.

Every combination of bits can be flipped in an 84-bit keyspace every year.

If mining is done strictly with ASICs and each ASIC can produce 1 trillion hashes per second, that's 1,000,000 ASICs.

If each ASIC above consumes 650 Watts of power, that's 650 Megawatts of power consumed.

At 650,000 kWh per ASIC, that's 1.3 million pounds of CO2 released into the atmosphere every hour if using fossil fuels.

Current global rate is about 160 Bitcoins mined per hour.

At $0.15 USD per kWh, that's $609 spent on electricity per Bitcoin mined. Bitcoin is currently trading at $376/BTC.

That's "back of the envelope" calculations, with some big assumptions made about the mining operation (how it's powered, who is powering it, etc.).

Of course, not all mining is using fossil fuels, not all miners are using ASICs, not all ASICs can do 1 trillion hashes per second (some more, some less), not all ASICs are consuming that wattage per rate, and the cost of electricity was strictly a U.S. figure. Of course, if you're using a GPU, or worse, a CPU for your mining, then you expending more electricity per the rate than ASICs are. That may help balance out some of the miners who are using renewable energy, such as solar power for their mining. Many Chinese and Russian mining data centers certainly have less overhead costs on electricity. You get the point- we've made some big assumptions, to come to some very rough "ballpark" figures. I don't think we're too far off.

So, according to those numbers, unless using renewable energy, cheaper electricity, or Bitcoin trading goes north of $610 USB/BTC mining for Bitcoin is a net loss. This comes at the expense of 1.3 million pounds of CO2 released into the atmosphere every hour. I would argue that Bitcoin is the worst idea to come out of Computer Science in the history of mankind.

Using Your Monitors As A Cryptographically Secure Pseudorandom Number Generator

Aaron Toponce — Thu, 21 Jan 2016 12:55:26 +0000

File this under the "I'm bored and have nothing better to do" category. While coming into work this morning, I was curious if I could use my monitors as a cryptographically secure pseudorandom number generator (CSPRNG). I don't know what use this would have, if any, as your GNU/Linux operating system already ships a CSPRNG with /dev/urandom. So, in reality, there is really no need to write a userspace CSPRNG. But what the hell, let's give it a try anywho.

The "cryptographically secure" piece of this will come from the SHA-512 function. Basically, the idea is this:

Take a screenshot of your monitors.

Take the SHA-512 of that screenshot.

Resize the screenshot to 10% it's original size.

Take the SHA-512 of that resized file.

Take the SHA-512 of your previous two SHA-512 digests.

Take the last n-bits of that final digest as your random number.

Most GNU/Linux systems come with ImageMagick pre-installed, as well as the "sha512sum(1)" function. So thankfully, we won't need to install any software. So, here's a simple shell script that can achieve our goals:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#!/bin/sh
# Produces random numbers in the range of [0, 65535].
# Not licensed. Released to the public domain.

cd ~/Private # assuming you have an encrypted filesystem mounted here

TS1=$(date +%Y%m%d%H%M%S%N)
import -window root ${TS1}.png
sha512sum ${TS1}.png > /tmp/SHA512SUMS

TS2=$(date +%Y%m%d%H%M%S%N)
convert -scale 10% ${TS1}.png ${TS2}.png
sha512sum ${TS2}.png >> /tmp/SHA512SUMS

DIGEST=$(sha512sum /tmp/SHA512SUMS)
printf "%d\n" 0x$(printf "$DIGEST" | awk '{print substr($1, 125, 128)}')

shred ${TS1}.png ${TS2}.png
rm ${TS1}.png ${TS2}.png
shred /tmp/SHA512SUMS
rm /tmp/SHA512SUMS

Running it for 10 random numbers:

$ for i in {1..10}; do sh monitor-csprng.sh; done 15750 36480 64651 7942 2367 10905 53889 9346 52726 63570

A couple things to note:

This is slow, due to taking the screenshot, and resizing it.

The data on your monitors should be sufficiently random. Chats, social updates, etc. The security of this will depend entirely on the entropy of the initial screenshot.

You really should be saving your screenshots to an encrypted filesystem, such as eCryptfs.

We're using timestamps with nanosecond accuracy to provide some additional entropy for the final SHA-512 digest.

This is using the last 4 hexadecimal characters to be converted to decimal. In reality, it could be anything, including some convoluted dynamic search algorithm in the string.

It's worth noting that the entropy of the initial screenshot is critical, which is actually difficult to accurately measure. So, it may help to have a chat window or more open, with recent chat logs. Same could be said for social update "walls", with the most recent updates (Twitter, Facebook, Goodreads, etc.). Having a clock with seconds ticking in a status bar can also help (although not unpredictable, at least semi-unique). Tabs in browsers, running applications, etc. The more unpredictable your workspace in the screenshot, the better off you'll be. But, people in general suck at randomness, so I'm not advocating this as something you should rely on for a cryptographically secure random number generator.

If you wanted, you could add this to a terminal, giving you a sort of "disco rave" before taking the screenshot:

1
2
3
4
5
6
7
#!/bin/sh
# Disco lights in your terminal
# No license. Released to the public domain

while true; do
printf "\e[38;5;$(($(od -d -N 2 -A n /dev/urandom)%$(tput colors)))mâ€¢\e[0m"
done

Get that running first, then take your screenshot. But then, if you're reading data off of /dev/urandom, you might as well do that for your random numbers anyway...

Disable Pocket From Iceweasel

Aaron Toponce — Sat, 02 Jan 2016 14:20:52 +0000

I'm not sure who I should be more disappointed in- Mozilla or Debian. Iceweasel 43 recently arrived in Debian unstable, and with it, Pocket. For those who are not familiar, Pocket is a 3rd party service that allows users to save sites they want to read or visit for later. Provided the extension is installed, this allows users to sync pages they want to read for later, across devices and platforms.

But here's the catch: it's a proprietary non-free service-as-a-software-substitue (SaSS).

Thankfully, you can disable it, and it really isn't that difficult. Open up about:config in a new tab, and type "pocket" into the search filter. From there, set "browser.pocket.api" and "browser.pocket.site" to "localhost", and set "browser.pocket.enabled" to "false", then restart your browser.

It really bothers me that Mozilla has enabled this sort of integration into their browser. Not only Pocket, but other proprietary or privacy invasive plugins and extensions also, such as "sponsored tiles" (which is finally removed), "encrypted media extensions", and "Hello" (which I haven't figured out how to disable). These sorts of things should be separate extensions or plugins that the user can install at their whim. Shipping it by default takes away freedom and choice, and it's turning the browser into a proprietary non-free software application.

What ultimately bothers me about this, is that Mozilla already has bookmark synchronization support, and their sync server is Free Software, allowing you to roll your own. Pocket doesn't offer anything that Mozilla Sync doesn't. I already have a "TOREAD" bookmark folder, where I can put pages I want to read later. And it's synched across all of my devices.

Mozilla pushing the 3rd party proprietary Pocket, and Debian shipping it in Iceweasel (thankfully, a bug is submitted) is a great disservice to users and a threat to software freedom.

Hopefully, Pocket goes the way of sponsored tiles, and gets removed.

Encrypted Account Passwords with Vim and GnuPG

Aaron Toponce — Thu, 31 Dec 2015 12:40:05 +0000

Background

I've been a long-time KeepassX user, and to be honest, I don't see that changing any time soon. I currently have my password database on an SSH-accessible server, of which I use kpcli as the main client for accessing the db. I use Keepass2Android with SFTP on my phone to get read-only access to the db, and I use sshfs mounts on my workstations with KeepassX for read-only GUI access. It works great, and allows me to securely access my password databases from any client, mobile or otherwise.

However, I recently stumbled on this post on how to use Vim with GnuPG to create an encrypted file of passwords: http://pig-monkey.com/2013/04/password-management-vim-gnupg/. I've heard about a GnuPG plugin for Vim for years now, and know friends that use it. I've even recommended that others use it as a simplistic means of keeping an encrypted password database, instead of relying on 3rd-party tools. However, I've never really used it myself. Well, after reading that post, I decided to give it a try.

Defining a specification

Ultimately, everything in that post I'm carrying over here, with only a couple modifications. First, fields should end with a colon, which include the comments. Comments could just be just a single line, or multi-line, but it's still a field just as much as "user" or "pass". Further, there should be a little flexibility in the field keywords, such as "user" or "username". Additionally, because I exported my Keepass db to an XML file, then used a Python script to convert it into this syntax, I also carried over some additional fields. So, I've defined my database with the following possible fields:

comment|comments

expire|expires

pass|password

tag|tags

type

url

user|username

Notice that I did not define a "title" as would be the case in the Keepass XML. The entry itself is the title, so I find this redundant. Also, you'll noticed I defined an additional "type" field. While not explicitly defined in the Keepass XML, it is implicitly defined with icons for entries. This could be useful for defining "ssh" vs "mysql" vs "ldap" vs "http" authentications when doing searching in the file.

So, an invalid example on pig-monkey.com is:

Super Ecommerce{{{ user: foobar pass: g0d Comments{{{ birthday: 1/1/1911 first car: delorean }}} }}}

This is invalid due to the "Comments" field. Fixed would be:

Super Ecommerce{{{ user: foobar pass: g0d Comments:{{{ birthday: 1/1/1911 first car: delorean }}} }}}

Another valid entry could be:

Example {{{ username: aarontoponce password: toomanysecrets url: https://example.com type: http tags: internet,social,2fa comments: {{{ backup codes: vbrd83ezn2rjeyj, p89r4zdpjmyys2k, rdh6e7ubz8vh82g, er4ug6vp25xsgn5 2fa-key: "3udw mkmm uszh cw2a 5agm 7c3p 5x32 tyqz" }}} }}}

Notice that I have not defined file comments, such as those found in configuration files or source code. There is a comment section per entry, so that seems to be the fitting place for any and all comments.

I really liked the post, and how thought out the whole thing was, including automatically closing the PGP file after an inactivity timeout, automatically folding entries to prevent shoulder surfing, and clearing the clipboard when Vim closes. However, one oversight that bothered me, was not concealing the actual password when the entry is expanded. Thankfully, Vim supports syntax highlighting. So, we just need to define a filetype for GnuPG encrypted accounts, and define syntax rules.

Vim syntax highlighting

EDITED TO ADD: I tried getting the Vim syntax working in this post, but WordPress is clobbering it. So, you'll need to get it from pastebin instead. Sorry.

To get this working, we need a syntax file. I don't know if one exists already for this syntax structure, but it isn't too difficult to define one. Let's look at what I've defined in this pastebin, then I'll go over it line-by-line.

The first four lines in the syntax file define are just comments. Next is just a simple if-statement checking if syntax highlighting is enabled. If so, use it. The first interesting line is the following:

let b:current_syntax = "gpgpass"

This defines our syntax. Whenever we load a file with syntax highlighting enabled, and we set the "filetype" to "gpgpass", this syntax will be applied.

syntax case ignore

This just allows us to have "comment" or "Comment" or "COMMENT" or any variations on the letter case, while still matching and proving a highlight for the match.

After that, we get into the meat of it. This "syntax match" section allows me to conceal the passwords on the terminal to prevent shoulder surfing, even when the entry is expanded. this is done with setting the background terminal color to "red" and the foreground text color also to "red". Thus, we have red text on a red background. The text is still yankable and copyable, even with the mouse cursor, it's just not visible on screen.

The actual concealment is done with the regular expression. An atom is created to match "pass:" or "password:" surrounded by whitespace as the first word on the line. However, I don't want to conceal the actual text "pass:", just the password itself. So, the regular expression "\@<=" says to ignore our atom in the match, and only match "\S\+" for concealing. The concealment is achieved with red foreground text on a red background with:

highlight gpgpassPasswords ctermbg=red ctermfg=red

The rest of the syntax matching in that pastebin is for identifying our fields, and highlighting them as a "Keyword" using regular expressions. All field names will be highlighted the same color based on your colorscheme, as they are all defined the same. Thus, aside from the hidden password, there is uniformity and elegance in the presentation of the syntax.

Using the syntax in Vim

This syntax file won't do us much good if it isn't installed and Vim isn't configured to use it. We could save it system-wide to "/usr/share/vim/vim74/syntax/gpgpass.vim", or just keep it in our home directory at "~/.vim/syntax/gpgpass.vim". Whatever works.

Now that the syntax file is installed, we need to call it when editing or viewing GnuPG password files. We can use the vimrc from pig-monkey.com, with one addition- we're going to add "set filetype=gpgpass" under the "SetGPGOptions()" function. Now, I understand that you may edit encrypted files that are not GnuPG password files. So, you're going to get syntax highlighting in those cases. Or, you could enable the modeline and set a modeline in the password file. The problem with the modeline, is its long history of vulnerabilities. Most distributions, including Debain, disable it, and for good reason too. So, I'd rather have it set here, and unset the "filetype" if it's bothering me.

Here's the relevant config:

if has("autocmd") """""""""""""""""""" " GnuPG Extensions " """""""""""""""""""" " Tell the GnuPG plugin to armor new files. let g:GPGPreferArmor=1 " Tell the GnuPG plugin to sign new files. let g:GPGPreferSign=1 augroup GnuPGExtra " Set extra file options. autocmd BufReadCmd,FileReadCmd *.$gpg\|asc\|pgp$ call SetGPGOptions() " Automatically close unmodified files after inactivity. autocmd CursorHold *.$gpg\|asc\|pgp$ quit augroup END function SetGPGOptions() " Set the filetype for syntax highlighting. set filetype=gpgpass " Set updatetime to 1 minute. set updatetime=60000 " Fold at markers. set foldmethod=marker " Automatically close all folds. set foldclose=all " Only open folds with insert commands. set foldopen=insert endfunction endif " has ("autocmd")

Conclusion

What I like about this setup is the portability and simplicity. I am in a terminal on a GNU/Linux box most of my waking hours. It makes sense to use tools that I already enjoy, without needing to rely on 3rd party tools. This also closes the gap of potential bugs with 3rd party password managers leaking my passwords. I'm not saying that Vim and GnuPG won't be vulnerable, of course, but I do place more trust in these tools than the Keepass ones, to be honest.

As of right now, however, I am still a Keepass user. But, I wanted to put this together, and try it out for size, and see how the shoe fits. As such, I've exported my KeepassX database, encrypted it with GnuPG, configured Vim, and I'm off to the races. I'll give this a go for a few months, and see how I like it. I know it's going to pose issues for mp on my phone, even with ConnectBot and SSH keys. But, maybe I don't need it on my phone anyway. Time will tell.

Oh, and I can still view the database as read-only and still enjoy the syntax highlighting benefits by using "view /path/to/passwords.gpg" instead of "vim /path/to/passwords.gpg".

Multiple Encryption

Aaron Toponce — Sat, 26 Dec 2015 14:37:37 +0000

I hang out in ##crypto in Freenode, and every now and then, someone will ask about the security of multiple encryption, usually with the context that AES could be broken in the near future. When talking about multiple encryption, they are usually referring to cascade encryption which has the form of:

CT = Alg_B(Alg_A(M, key_A), key_B)

The discussion revolves around the differences between "Alg_A" and "Alg_B". Such as using AES for "Alg_A" and Camellia for "Alg_B". Also, the discussion will include whether or not "key_A" and "key_B" should be the same key, or different.

Cascade encryption is more efficient in storage space than some alternatives, such as this one suggested by Bruce Schneier:

CT = Alg_A(OTP, key_A) || Alg_B(XOR(M, OTP), key_B), where OTP is a true one-time pad

I'm not going to go into the theoretical concerns with multiple encryption. However, I would like to cover some practical considerations:

Multiple key security.

Long-term storage.

Complexity.

Host security.

Multiple key security

It should come as no surprise that when dealing with multiple encryption, that you are going to be dealing with multiple keys, if you choose to keep "key_A" and "key_B" separate. Probably the most difficult aspect of encryption implementations, is keeping the secret key secret. For example, key exchanges between machines over the scary Internet has been notoriously difficult to get correct. Current best practice is implementing authenticated ephemeral elliptic-curve Diffie-Hellman (ECDHE) when communicating secret symmetric keys between machines. So, not only do you need to communicate one key, but multiple keys when encrypting and decrypting data.

If the multiple-encrypted data is to be stored on disk, then keys will need to be retrieved for later. How are these stored? This isn't an easy question to answer. If you store them in a password manager, they are likely just getting single-encrypted, probably with AES. So, the security of your ciphertext rests on the security of your stored keys, likely protected by the very algorithm you are trying to safe-guard.

Now, you could use the same key for every encryption layer. But, this poses a theoretical concern (which I promised I wouldn't cover- sorry). If the same key is used for every layer, then if an attacker can recover the key through cryptanalysis of the first encryption layer, then the attacker could possibly decrypt all remaining layers. Obviously, you don't want to use ciphers where the decryption process is exactly the same as the encryption process. Otherwise, the second encryption process on the ciphertext would decrypt the first encryption! While not probable, this last scenario could even occur with different algorithms, such as AES and Camellia. So, it seems at least at a cursory glance, that using the same key for all encryption layers probably is not a wise idea. So, we're back to key management, which is the bane of cryptographers everywhere.

Long-term storage

In my opinion, a larger problem is that of storage. It's one thing to get multiple encryption correct and safe on the wire, it's another to place value on long-term data storage. Think about it for a second- what is the longest you have kept data on the same drive? In personal scenarios, I have some friends that have had personal backups for up to five years. To me, this is impressive. It's likely more common that data switches drives every couple of years. RAID arrays die, hardware is replaced, higher drive capacity is demanded, or even bit rot creeps in, destroying data (such as on magnetic or optical mediums). When push comes to shove, the encrypted data is just going to move from drive-to-drive. But, ask yourself this next question- what is the oldest data you have in your possession right now?

Let's be realistic here for a second. I would be hard-pressed to find data I stored back in 2000, 15 years ago. I could find some photos in photo albums, on mugs, and on Christmas cards, but I'm not 100% confident I could get the digital original. Despite my best efforts, accidents happen, mistakes are made, and data is just lost. I don't think I'm alone here. I've even worked for companies, with large budgets, that had a hard time recovering data that is 10+ years old. For one, it's expensive to hold on to data indefinitely, but a great amount of data also becomes less and less valuable as time progresses. Yes, I still use my same email account from 2004- Google has done a great job of keeping all of my emails these past 11 years, and I would expect other data service providers to do the same. But, how many of you have kept an email address for 10+ years? Or even the data, for that matter? (This blog is actually 11 years old as well- kudos to me on keeping the data going this long).

My point is, hardware fails and is changed. Your personal value on data also changes, and accidents happen. So you're concerned about AES being broken in 20 years, or even sooner. Do you think by that time you'll still place value on that encrypted data? Do you think you'll even still have access to it, or can find it? And, if so, will it really be that difficult to decrypt the AES data, and encrypt it with the current best practice encryption algorithm?

Complexity

This is probably the problem you should be concerned with the most. As a collective group, we as developers have a hard time getting single encryption correct, let alone multiple encryption. This deeply enters the theoretical realm, which I promised I wouldn't blog about. But, you do have a practical concern as well- order of operations and correct implementations.

First, order of operations. It's one thing to do "double encryption", where only two algorithms are chosen and used. If you can't recall if you used AES first or second, it's a 50/50 shot at getting the order correct (provided you know which key belongs to which algorithm, otherwise it's a one-in-four chance). Imagine however, using three encryption layers, and lining the keys up correctly. Imagine the complexity of four layers, or more. Ugh. Seems like you certainly don't want to go higher than two layers.

Second, look at implementations. AES is AES. It shouldn't matter what algorithm does the calculations. But, implementations like to put "magic bytes" at the beginning of ciphertexts (OpenSSL, OpenPGP, etc.). This data is only valuable for that implementation, and even worst, for a specific subset of versions. Just imagine encrypting a file with OpenSSL version 1.0 now, and needing to decrypt it in 10 years. Will OpenSSL version X be able to read those magic bytes, and correctly decrypt the file? Or will it error out, unable to decrypt the data because the data structure of the magic bytes changed in that 10 year time frame?

So, it seems best to encrypt it with some programming language library, where you can control exactly what data is stored. But, as everyone will tell you while frothing at the mouth, "don't roll your own crypto". Technically, you aren't if you "import aes" and use the "aes" module provided by that language correctly. It just remains to be seen if you implemented it correctly to thwart an attacker. Crypto is hard and full of sharp edges. It's very difficult to get things right, without getting cut. Regardless, while the "aes" module might be available in 10 years, what about the "camellia" module, or whatever algorithm you chose for the second layer? Is it still in development, or was it abandoned due to either being broken, or lack of development? Can you find that module, so you can decrypt your data?

Host Security

In a more practical real-world, everyday person scenario, how secure is the host that is doing the multiple encryption? Do others have physical access to the machine? Is it free of viruses, malware, and other badware? Does the system run an encrypted filesystem? Where and how are backups stored? Who has access to those backups? So many more questions can be asked that judge the quality of the security level of the host storing or processing the data.

Viruses and malware would probably be my number one concern if the data was so valuable, as to be multiple-encrypted. So, I would probably encrypt the plaintext on one machine, encrypt the ciphertext on a second machine, and store it on a third machine, preferably air-gapped. Thus, if a virus exists on one machine, hopefully it doesn't exist on another, and hopefully it doesn't attach itself to my encrypted data, and hopefully the badware didn't report my plaintext to a botnet pre-encryption.

Physical host security is hard. People have crappy passwords protecting their workstations. Physical access can get the attacker root regardless. Systems are infected with badware all the time, just by visiting websites! So there is hardly a guarantee that your data is safe, even though it was encrypted multiple times with different keys and algorithms.

A Couple Thoughts

It hardly seems worth the effort to encrypt your data multiple times with different algorithms and different keys, provided the overhead necessary in managing everything (hardware and software). Further, in reality, modern encryption algorithms aren't usually broken. For example, DES as an algorithm, isn't broken- it just requires a small key space. So, encrypting your data multiple times is solving a problem that for the most part, just doesn't exist.

That's not to say that AES will remain secure in 10, 20, or 40 years. I'm not that naive. But, as a user, you do have the ability to switch algorithms when AES does break. So, decrypt your AES ciphertext, and encrypt it with SevenFish (sorry Bruce- bad joke). Keep it encrypted with SevenFish until that breaks, and then decrypt it, and encrypt it with whatever the new modern cipher is at the time (if you still have the data, it's still valuable to you, and all implementations can still work with the ciphertext).

Conclusion

In my opinion, don't worry about multiple encryption. Generate a GnuPG key pair, encrypt your data once, and be done with it.

Getting Root On The Nexus 6 With Android 6

Aaron Toponce — Tue, 22 Dec 2015 21:25:45 +0000

This probably the 40th millionth time, since owning this phone, that I've needed to root my device. Because I keep doing it over and over, while also referring to past commands and notes, it's high time I blogged the steps. If I can benefit myself from my own blog post, then chances are someone else can. So, with that said, here's what we're going to do:

Grab the latest Nexus factory images from Google.

Update the phone by flashing all the images (without wiping user data).

Flash the recovery with the latest TWRP image.

Get root on the device with Chainfire's "system-less root" SuperSU package.

Enable USB tethering and the wireless hotspot functionality.

Before beginning, I should mention that if the title isn't immediately clear, this post is specific to the Motorola Nexus 6, which is the phone I currently own. It's probably generic enough, however, to be applied to a few Nexus devices. Minus getting the factory Nexus images from Google, this might even be generic enough for non-Nexus devices, but you're on your own there. Proceed at your own risk. With that said, it's fairly hard to brick an Android phone these days.

Also, you need to make sure you have an unlocked bootloader. Google ships with the bootloader locked by default. Unlocking it, will wipe your user partition, meaning you will lose any and all user data (images, videos, text messages, application data, etc.). I'm going to assume that you've already unlocked the bootloader, and are ready to proceed.

TL;DR

If you don't want to read the post, and know what you're doing, here's the short of it:

$ tar -xf shamu-mmb29k-factory-9a76896b.tgz $ cd shamu-mmb29k $ adb reboot bootloader $ fastboot flash bootloader bootloader-shamu-moto-apq8084-71.15.img $ fastboot reboot-bootloader $ fastboot flash radio radio-shamu-d4.01-9625-05.32+fsg-9625-02.109.img $ fastboot reboot-bootloader $ fastboot update image-shamu-mmb29k.zip $ fastboot flash recovery twrp-2.8.7.1-shamu.img $ fastboot reboot recovery (reboot normally) $ adb push UPDATE-SuperSU-v2.46.zip /sdcard/supersu.zip $ adb reboot recovery (install /sdcard/supersu.zip from TWRP) (do not install TWRP root) (reboot normally) (install build.prop editor from Google Play) (set "net.tethering.noprovisioning" to "true")

Otherwise ...

Getting the Google Nexus factory images

Navigate to https://developers.google.com/android/nexus/images#shamu and grab the version you are looking for. For example, I recently wanted to flash 6.0.1, so I grabbed the "MMB29K" image. Before flashing, I find it critical to verify the checksums. They are "27dde1258ccbcbdd3451d7751ab0259d" for MD5 and "9a76896bed0a0145dc71ff14c55f0a590b83525d" for SHA-1. So, after downloading, I pulled up a terminal, and verified them:

$ md5sum shamu-mmb29k-factory-9a76896b.tgz 27dde1258ccbcbdd3451d7751ab0259d shamu-mmb29k-factory-9a76896b.tgz $ sha1sum shamu-mmb29k-factory-9a76896b.tgz 9a76896bed0a0145dc71ff14c55f0a590b83525d shamu-mmb29k-factory-9a76896b.tgz

After examination, it's clear these checksums match, so I'm ready to flash.

Flashing the images

This step does not require root on your device. I'll need to connect my phone to my computer via USB, and verify that I can talk to it via adb(1). This means installing the Debian "android-tools-adb" and "android-tools-fastboot" packages if they're not already. After installed, I should be able to verify that I can talk to the phone:

$ sudo apt-get install android-tools-adb android-tools-fastboot (...snip...) $ adb devices List of devices attached [serial number] device

If your device is visible, we are ready to rock-n-roll. First, extract the tarball, and enter the directory:

$ tar -xf shamu-mmb29k-factory-9a76896b.tgz $ cd shamu-mmb29k $ ls -lh total 2.3G -rw-r--r-- 1 atoponce atoponce 124 Jan 1 2009 android-info.txt -rw-r--r-- 1 atoponce atoponce 8.1M Jan 1 2009 boot.img -rw-r----- 1 atoponce atoponce 11M Nov 18 16:59 bootloader-shamu-moto-apq8084-71.15.img -rw-r--r-- 1 atoponce atoponce 6.2M Jan 1 2009 cache.img -rw-r----- 1 atoponce atoponce 985 Nov 18 16:59 flash-all.bat -rwxr-x--x 1 atoponce atoponce 856 Nov 18 16:59 flash-all.sh* -rwxr-x--x 1 atoponce atoponce 814 Nov 18 16:59 flash-base.sh* -rw-r----- 1 atoponce atoponce 964M Nov 18 16:59 image-shamu-mmb29k.zip -rw-r----- 1 atoponce atoponce 113M Nov 18 16:59 radio-shamu-d4.01-9625-05.32+fsg-9625-02.109.img -rw-r--r-- 1 atoponce atoponce 8.8M Jan 1 2009 recovery.img -rw-r--r-- 1 atoponce atoponce 2.0G Jan 1 2009 system.img -rw-r--r-- 1 atoponce atoponce 136M Jan 1 2009 userdata.img

Notice a couple of things- first, there are shell scripts "flash-all.sh" and "flash-base.sh" for Unix-like systems. Also, notice the "bootloader-shamu-moto-apq8084-71.15.img" & "radio-shamu-d4.01-9625-05.32+fsg-9625-02.109.img" raw images, as well as the "image-shamu-mmb29k.zip". These are the only files we're going to concern ourselves with when flashing the phone.

However, we want to be careful that we don't flash "userdata.img". This will format your user partition and all user data will be wiped (see above). What we're going to do, is basically the same execution as the "flash-all.sh" shell script. However, we're going to make just one small modification. Further, we need our phone already booted into the bootloader. As such, here's what we're going to do:

$ adb reboot bootloader $ fastboot flash bootloader bootloader-shamu-moto-apq8084-71.15.img $ fastboot reboot-bootloader $ fastboot flash radio radio-shamu-d4.01-9625-05.32+fsg-9625-02.109.img $ fastboot reboot-bootloader $ fastboot update image-shamu-mmb29k.zip

Notice that I removed -w from that last command (if you looked in the "flash-all.sh" shell script). That option wipes user data, which would be necessary if we wanted to return the phone back to factory state. We don't- we're just upgrading. Also, I don't see the need for "sleep 5". Just wait for the phone to successfully reboot before running the next command.

At this point, the phone is successfully updated. If you were to reboot the phone, it would be perfectly operational as if you did an OTA update, or purchased it from the store. However, we want root, so we have a few more steps to accomplish.

Getting and flashing TWRP

This step also does not require root on your phone. I prefer TWRP for my recovery on Android. It's touch-based, which sets the UI apart from the other recoveries, and it's Free Software, unlike ClockworkMod. Both of these are big wins for me. Grab the latest image at https://twrp.me/devices/motorolanexus6.html. I downloaded twrp-2.8.1.7-shamu.img. Unfortunately, I couldn't find any checksums to check to verify the download. So, I installed it anyway, knowing I could flash the stock "recovery.img" if something goes wrong. So far, things have been great, so I calculated the checksums for you:

$ md5sum twrp-2.8.7.1-shamu.img f040c3a26f71dfce2f04339f62e162b8 twrp-2.8.7.1-shamu.img $ sha1sum twrp-2.8.7.1-shamu.img 40017e584879fad2be4043c397067fe4d2d76c88 twrp-2.8.7.1-shamu.img $ sha256sum twrp-2.8.7.1-shamu.img ebe5af833e8b626e478b11feb99a566445d5686671dcbade17fe39c5ce8517c7 twrp-2.8.7.1-shamu.img

If those checkout, you should be safe in flashing. Currently, the phone should already be booted into the bootloader. If not, make sure it is. Once in the bootloader, we can flash TWRP then reboot normally:

$ fastboot flash recovery twrp-2.8.7.1-shamu.img

Now, it's critical that we don't normally reboot the phone. If we do, recovery will be overwritten, and we'll have to reflash. So, while your phone is still booted into the bootloader, reboot it into recovery. You can do this by pressing the volume up/down arrows, until rebooting into recovery is available, and pressing the power button. This should boot you into TWRP. Now that you're there, you can reboot the phone normally.

WARNING
It is possible that while booting, your phone will notify you that the system cannot be verified. One of two things will happen: either the boot will pause, and not go further, or will boot without despite the warning. If you flashed these exact versions, my phone boots without the warning at all. However, don't panic if you see it. Remember, you have the factory images. Just reflash the recovery.img, and you will be just fine.

More info can be found at http://www.xda-developers.com/a-look-at-marshmallow-root-verity-complications/.

Getting and flashing SuperSU (getting root)

WARNING
At this point, the phone should be booted into its regular state. We are now ready to root the phone. This means getting the latest SuperSU package, and installing it through TWRP. However, I need to throw out another caution. We'll be installing a beta version of SuperSU to do something called "system-less root". This means that the package will only be modifying the bootloader image to get root, and will not be touching the system partition. This is both good, and bad. It's good in that we only need to reflash the bootloader to remove root. It's bad in that this is experimental software, and really not ready for production. Further, unlike TWRP, SuperSU is proprietary software, which sucks. It does make me a bit nervous, to be honest, to rely on non-free closed-source proprietary software, on such a critical piece of my life. Proceed at your own risk.

As of this writing, you'll need to get the SuperSU package from the XDA forums at http://forum.xda-developers.com/showpost.php?p=64161125&postcount;=3. I grabbed version "BETA-SuperSU-v2.64-20151220185127.zip". There may be updates since this post was published.

Unfortunately, again, I did not see any published checksums. So, I've installed it, with the knowledge of how to reflash my bootloader should I encounter problems.

$ md5sum UPDATE-SuperSU-v2.46.zip 332de336aee7337954202475eeaea453 UPDATE-SuperSU-v2.46.zip $ sha1sum UPDATE-SuperSU-v2.46.zip 6135f9d0af28e02f4292c324bf5983998e7ae006 UPDATE-SuperSU-v2.46.zip $ sha256sum UPDATE-SuperSU-v2.46.zip d44cdd09e99561132b2a4cd19d707f7126722a9c051dc23f065a948c7248dc4e UPDATE-SuperSU-v2.46.zip

Provided these checksums match, we're good to go. We need to push the ZIP to our phone with the Android debugger, and reboot into the TWRP recovery:

$ adb push UPDATE-SuperSU-v2.46.zip /sdcard/supersu.zip $ adb reboot recovery

From the TWRP interface, tap "Install" and install the /sdcard/supersu.zip package. When it finishes, tap "Reboot". TWRP will ask if you would like to install the root provided by the image. You do NOT want to install this root- you just flashed one.

The phone should boot normally.

Enable USB tethering and the wireless hotspot

This step requires root. Finally, we want to enable the hotspot and tethering. Google is bending to wireless carriers, forcing the user to prove that they are subscribing to a cellular service that allows them to use USB tethering or the wireless hotspot. Personally, I find this dirty, and unfortunate. Even worse, is the fact that cellular providers think they can get away by charging double for using your own data. Data is data; it shouldn't matter if it comes from your phone, or your laptop connected to your phone. If they want to charge for overages on caps, whatever. But charging double, just because you connected your phone via USB? Or setting up a hotspot in your grandma's house, because she doesn't have WiFi but you have cellular coverage? Please. This is clearly grandfathered from the days of feature phones, where you couldn't tether or hotspot. So, you purchased a USB dongle to enable the hotspot. Even then, it was dirty, but it's clear that this is a byproduct of days gone by.

To enable tethering and the hotspot, you just need to add one line to /system/build.prop config file. Unfortunately, /system/ is mounted read-only. So, you'll have to remount it as read-write and edit the file. However, every attempt I have made at modifying it has ended up with an empty file- IE: losing all its contents. So, rather than editing it manually, there is an app for that.

Install https://play.google.com/store/apps/details?id=com.jrummy.apps.build.prop.editor&hl=en. Add "net.tethering.noprovisioning" and set the property to "true", then reboot your phone. At that point, you should be able to USB tether and setup a wireless hotspot.

Conclusion

This wasn't for the faint of heart or for someone who doesn't care about gaining the necessary control over their Android phone that root would give them (setting up firewalls, ad blockers, tethering/hotspot, etc.). However, as mentioned earlier, it's getting fairly difficult to hard brick and Android phone these days. Even better, the steps are getting somewhat standardized. IE: flash factory images, flash custom recovery, install SuperSU, & optionally enable tethering/hotspot.

Your GnuPG Private Key

Aaron Toponce — Fri, 20 Nov 2015 02:36:08 +0000

This post is inspired by a discussion in irc://irc.freenode.net/#gnupg about Keybase and a blog post by Filippo Valsorda.

I was curious just exactly how my private key is encrypted. Turns out, gpg(1) can tell you directly:

$ gpg --output /tmp/secret-key.gpg --export-secret-keys 0x22EEE0488086060F $ gpg --list-packets /tmp/secret-key.gpg :secret key packet: version 4, algo 17, created 1095486266, expires 0 skey[0]: [1024 bits] skey[1]: [160 bits] skey[2]: [1023 bits] skey[3]: [1023 bits] iter+salt S2K, algo: 3, SHA1 protection, hash: 2, salt: ad8d24911a490591 protect count: 65536 (96) protect IV: 01 1e 07 58 4a b6 68 a0 encrypted stuff follows keyid: 22EEE0488086060F (...snip...)

Notice the line "iter+salt S2K, algo: 3, SHA1 protection, hash: 2, salt: ad8d24911a490591". In there, you see "algo: 3" and "hash: 2". What do those identifiers reference? If you refer to RFC4880, you can learn what they are:

Symmetric Encryption Algorithms

Plaintext or unencrypted data

IDEA

3DES

CAST5

Blowfish

Reserved

Reserved

AES-128

AES-192

AES-256

Twofish

Cryptographic Hashing Algorithms

MD5

SHA-1

RIPEMD-160

Reserved

Reserved

Reserved

Reserved

SHA-256

SHA-384

SHA-512

SHA-224

I emphasized the defaults, which are CAST5 and SHA-1. So, your key is encrypted with the SHA-1 of your passphrase, which is used as the key for CAST5 to encrypt your private key. Thus, the whole security of your encrypted private key rests on the entropy of your passphrase, provided that sane defaults are chosen for the encryption and hashing algorithms, which they are.

CAST5 has been well analyzed and it is not showing any practical or near practical weaknesses. It is a sane default to chose for a symmetric encryption algorithm. However, CAST5 uses 64-bit blocks for encrypting and decrypting data, which may have some theoretical weaknesses. AES uses 128-bit blocks, and thus has a larger security margin. Because AES-256 is available as a symmetric encryption algorithm, there really is no reason to not use it, aside from feeling more secure.

SHA-1 is showing near practical attacks on blind collisions, but for use with keying a block cipher from a passphrase, it's still exceptionally secure. What is needed to break SHA-1 in this regard, is a pre-image attack. A pre-image attack is where you have the hash, but you do not know the input that created it. This is not brute force. This is able to break the algorithm in such a way, that provided with any hash, you can reliably produce its input. SHA-1 has a wide security margin here, so there really is nothing practical to worry about. However, with SHA-512 available, there is also really no reason why not to use a SHA-2 algorithm. In fact, aside from the increase security margin, SHA-512 is designed to work well on 64-bit platforms, but struggle with 32-bit. So this gives us an increased security margin, albeit negligible, against using something like SHA-256.

So, how can we change these? Turns out to be quite simple. All you need to do is specify the secret key symmetric encryption algorithm and hashing algorithm, then change your password (retype it with the same password if you don't want to change it):

$ gpg --s2k-cipher-algo AES256 --s2k-digest-algo SHA512 --edit-key 0x22EEE0488086060F Secret key is available. pub 1024D/0x22EEE0488086060F created: 2004-09-18 expires: never usage: SCA trust: unknown validity: unknown sub 1792g/0x7345917EE7D41E4B created: 2004-09-18 expires: never usage: E sub 2048R/0xCE7911B7FC04088F created: 2005-07-04 expires: never usage: S (...snip...) gpg> passwd (...snip...) gpg> save

Now if we export our key, and look at the OpenPGP packets, we should see the new updates:

$ gpg --output /tmp/secret-key.gpg --export-secret-keys 0x22EEE0488086060F $ gpg --list-packets /tmp/secret-key.gpg :secret key packet: version 4, algo 17, created 1095486266, expires 0 skey[0]: [1024 bits] skey[1]: [160 bits] skey[2]: [1023 bits] skey[3]: [1023 bits] iter+salt S2K, algo: 9, SHA1 protection, hash: 10, salt: 9c3dbf2880791f2e protect count: 65536 (96) protect IV: db f5 e8 1c 98 03 99 7c 77 33 4e cd d3 3c 1f 4f encrypted stuff follows keyid: 22EEE0488086060F (...snip...)

Now I have "algo: 9" which is "AES256" and "hash: 10" which is "SHA512" protecting my private key. I've gained a little bit extra security margin at the cost of retyping my passphrase. Not bad. Now, I'd like to pose you a question:

If I were to publish my encrypted GnuPG private key, think I'm crazy?

Let's look at this for a second. We know that the key is protected with AES-256. AES remains secure, after 15 years of intense scrutiny and analysis, and is showing no signs of wear. It's the most used symmetric encryption algorithm that protects HTTPS, SSH, VPN, OTR, Tor, and even your GnuPG encrypted emails. It protects social security numbers, credit card transactions, usernames and passwords, patient health records, hard drives, and on, and on, and on. AES is secure.

So, breaking the encrypted key will be at least as hard as breaking AES, which seems to be a long shot. So, you would be better off attacking my passphrase. Knowing that it was used as the key for AES, and that it is hashed with SHA-512, we can write a brute force algorithm to generate passphrases, hash them with SHA-512, and attempt at decrypting the AES key. After all, we have the salt and the IV right there in plaintext in the packets. Turns out, there are already projects for this.

This should be a breeze. Unless, of course, you have sufficient entropy behind your passphrase. What is sufficient entropy? Well, take some time searching "entropy" on my blog. You'll quickly learn that I recommend at least 80-bits. 100-bits would give you a paranoid security margin, even with the password being hashing with a fast hashing algorithm like SHA-512.

So, if the private key is encrypted with AES-256, and the passphrase has sufficient entropy to withstand a sophisticated attack, and it is hashed with SHA-512, then I should have no concern posting my encrypted key into the open, right?

Without further ado, here is my encrypted GnuPG private key:

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 - -----BEGIN PGP PRIVATE KEY BLOCK----- Version: GnuPG v1 lQHpBEFLyzoRBACXCUta5CK+DCgnXn9wkqUumkcbenibGPBe3Y8IEY4BjkdbGdTN tiGB+Tvo0hzn2qzy4mNPlOx/LWZWF2MdwF3WS77wwIskMb8W314zhE2RS0G318YY X7zMGSF+7QiNXNsW/d0t1RonYOKIS96zKOtFQZrTr//V8+1rxEa4rvO5dwCgul0s pt2BUDqwoy2Q/5UKgnmrzmsD/37/3g5zXykvTH2P6BlgTdfnVvpOLDT3CyWlAynz u5hdmgYNT50I2w5TstY+uViYhAbMiyIT1HwBRcaQh8hUWkzDGyzJF7pS4pZeD0M9 u0P7Cejm2+ENdOX66ablWjP7GLJRcToGxnAZ6hgPpWLen8lHYaUK//g4JJx8UJ/n wifeA/9xYWDi3ur/fFCKQZIPV9Ziw1oL58su948yWRn2WN7m74+bSldkXzkc4jRe Q51FpGBHMswRIJKB6yG1FbfLum8ppGbvtz9NrMMZuirguTWetX8aJrjr0ddGjTsY uZPfKoUiqDUXSFc3hmVgQQQ4MFdD3XYy6AQTyI1vstCS/Tdn7P4JAwqcPb8ogHkf LmDb9egcmAOZfHczTs3TPB9Pg1SJqjvSz7nKDY87EVmeM46YBaCs1XScaOF4Gs+x u0LNAxlfX3xOUIWRtCdBYXJvbiBUb3BvbmNlIDxhYXJvbi50b3BvbmNlQGdtYWls LmNvbT6IWQQTEQIAGQUCQUvLOgQLBwMCAxUCAwMWAgECHgECF4AACgkQIu7gSICG Bg/+cACeM0EeO7gE85/OSwMzxjvQAGB53jgAnik6qvFWyQtvp71KElbpZUsa0YNj nQIoBEFLy0IQBwCjVGmY/PmOtRHtBIuANfg9zf8thGXZtFZWEgzHLGUgSfIjb0di F24mwiVw2k3gzqBKuFBJ633F3AhwlTBnXS3tLWQgwSrm3BcCOOn+wJvwgUXa0iBn gXhcq/7IL7HnKsYiG9EFMI8mAd10t7zdsA/dS2xUmFNexQvUdra/uRU+eeQbrXYe iwfymw2RCZROl4/QXA9/a/aTyUhKgkj2vieo0jh394h7grPZxw0lgCclvTN/0jGq dPkp56NMDb0eGlVzWeEiseD1JxXeYaSeToJP3zmx+nFoiBa+VVVeUhzYAwADBQb+ P373jEOwDu63py4FWdMPMYNWv1xFWLYI5hWaTKPSRwG/NZICRDF+QNztSmVOhW7Z SFY/nTHq5yFp+QID63VMGv5Cunse9QXAoarecyV2hllwUq7l7wHujJhvvqyEgL9G ah/drkZMGe8btYihz/M4g5i1P2DPr4CL/46eZxgjmjuVw7Nb0UsgUPgGizPCbnJ3 ye1ahxc1dOX80Guh0ZDRfR/ehZkk07wN2H6KRrxFCAmDaCR9KxwGYbpepND0t4HG qCMji37lzYUrS4PV6yK0DGekqF98xgiIVBYFjF0jwQD+CQMKnD2/KIB5Hy5g4K9F 3JFIPxw+L+Gdc2MSYuJI3Y5kpZluUmYYYYFgOH64/J8egeYSSKWUIPqMhnwBWtbW GQCNdztQyIIi6mB3YiDeK2AMnhRq+PwwwGG1iEYEGBECAAYFAkFLy0IACgkQIu7g SICGBg+MhQCfZ7FNu4wMtdifkblGkN5Qqj+cYWYAoLiipdgnnhPTP2z7SgOsxiR4 YI4wnQPEBELI4BIBCADCobEk1f0sByuV4p2moEmZIXXEJhzTolO2mmBLBSmbjPMg OBpAFTmWYXJxo8oZnNeeTOWN/hsHV9rmC/c/bZ7FGX4u/pN0l3qOSoSO6m6yvM/O q4idyl17SfqDR9AUdEwJIA9zfSomzmzfO/q3rIAOPNkd71RQ4YMkHn8SfQLaojxl +N5pPB29c6Cy+hltP6JRibAHgYCkxQGLK6ZLQ1LTwkDLNxCH8qfmHvg7M3qcXxl4 KOgtb8kNZ3PpAvek1GNL/eaF7nnd/u5uHuJM5V7czVHjpMbBfRGgLzR6Tu6v8qm7 nJzhSErxC7lK0hXLgGuHlk26bj9pzb+lfBk7GnkRAAYp/gkDCpw9vyiAeR8uYPUV mINtxpejZeID6EaNZ8wTKNB1e3yt/As2Svkpcl1ivosbADhxCTFSJ4RtmEynFPCw qgYbjuTBK+Fv2LSd7CYHJm4zZ/hfiCBgL4jI3ZY52qHFnBhgHtLwza/eLeaC0i4+ RWUAN9JEW1Ygpsh0ApRd1UvY8dGoLF24fy3C2jeh5Q3SVw2SnhUryLU7u6gTTLt4 yskT+/pzdNrb2n3dvLLUhQJ50SeTroa/swB/SXnjWtDSRGpvv3HAStHSENaZCIrV e/u/WSNd99nklpYdPwh3UFO9OOayYErYiprEhLC7JjdVbqbu2aIfrw+nS/PS+tlm wQTYH/DKA6wIeSZK4y9Byfrx3oq4DPk9pge7n3z3oz9pzd+xG2GMvExwKiCUlcF5 he5Peb4sTsJz0epMpwcBXFNYyVO++rZvBTABhgY5MQAcympoBAAcspo8AvBWx2xK SaIyp5XDtT8jqey5Jjo3qmIQvdpo5oQub73JCMDSHi7KZ/IkgtggpgDqNh9bT5ZE C7d3lprPFQJNQtFTO2K8NecRkgatn9Imomy1DidlnsCuZjqfamNVssWhm0uNu8SC aJYBtC9kZXhBLAy0mFCMfJQ8ysDzlG3+Fu9wE9tCQvkh7y0ZKo9Ortqnwqp6S0I3 2xMBOg2EMHRP+ADF3q/i2wlgbH0MHsiX4oMxd0eIZV+rJIQSnO547UvvzAUBjXjZ 1l23wmg8FctYIE0UPsTo1QNArGO1sIOdv807UMolB9lq1pKAXkJP0GPRZLv9bLDj kkKLEBELXCCQA0vDH+N/eGXRdJbWv5i0mOXmxK2JAZZw8rsJF3XZ22PABrC0NaCP qHftcq20PPMCRZ/TfjQmwlot495KmLUJo2G6JasdcEilyPH2GVUwiqCM2/wOWk4U xX+FmA40NYmU64hJBBgRAgAJBQJCyOASAhsCAAoJECLu4EiAhgYPt9kAn3r0Hhf9 aTFyowH3pgqIUiaMVQo5AKCFWzcU23YT0E2/LZNl6Yqzcs113Q== =SHc5 - -----END PGP PRIVATE KEY BLOCK----- -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBCgAGBQJWToHpAAoJEM55Ebf8BAiPk7kH+weVf8kVJRjSaSWE+Aft76iA Nzj1cVUfpWoT/K139i3TMiZ6PpAQtCRyEakdxfeSfXiOz83pqmKSL5ADCdlRoxuB HtkoLW6thETOs70mDrrsEQgBZgMYPMsiKG1W/M3xppRGZxUM7/UEXhjHYiThe1Qd Dkwot+hu5EttQpu0kKFmPrviPpJOk0gJ5SQrhlROWCS+aT9TyhbswMRpSyurDZ2H LppGk8EtBeWTsTf9AhemX1GFu4iJPwIDfZtiWOLGGjQn4ROqb/RWqLG254O//Gw6 jtDRGHIGyYk+2NQ6/gAKWI9Sxaz5kUxqKSzDU9WuDCE3peIB9HXM+ynFVKLgsXE= =Rnfg -----END PGP SIGNATURE-----

Now that this is posted, I should expect everyone to steal my GnuPG identity, parading around as me, forging signatures, correct? No, it's not happening. I trust the crypto. I trust the math. I trust the software.

Why? Why am I doing this? Well, a recent discussion popped up on IRC about Keybase. Some don't like the fact that they are encouraging you to upload your encrypted private key to the server. Some claim that it is "insecure". However, didn't we just logically and reasonably conclude that AES and SHA-2 are protecting my best interests? So, if it's insecure, then it can't be due to the crypto. It must be something else. Let's look at all the risk factors:

Keybase servers are compromised and you use the web interface
If the server is compromised, then the attackers can modify the code, providing malicious JavaScript to the browser, so when you successfully decrypt your private key, it can be sent and stored elsewhere under their control. There is nothing you can do here. You are screwed. This is a very valid concern.

Keybase servers are compromised and you use the command line interface
The attackers only have access to your encrypted private key. Provided you never use the web interface, all encryption and decryption as handled out-of-band. This means that regardless of the Keybase server compromise, the attackers will never actually get to forge signatures using your GnuPG private key.

Your local client is compromised
There is nothing I can do for you here. Keybase isn't to blame, and not even the gpg(1) client can protect you. You have bigger problems on your hands than the attackers gaining access to your unencrypted GnuPG private key.

I think a lot of GnuPG users are paranoid about their key getting leaked, because they are unsure of exactly how it is stored. Hopefully this post lays those fears to rest. If there are still concerns about the leak of encrypted private keys, then it's probably due to a fear of not fully understanding the strength of passwords and their entropy requirements. However, if you meet the necessary password entropy requirements, and your key is encrypted with CAST5/SHA-1 or AES-256/SHA-512, there is nothing wrong with keeping a backup of your GnuPG key on Github, AWS, Dropbox, or other "cloud" hosting solutions.

Trust the math. Trust the software.

By the way, if you do successfully recover my private key passphrase (and I know you won't), I would be interested in hearing from you. Send me a signed email with my private key, and I'll make it financially worth your while.

Now Using miniLock

Aaron Toponce — Wed, 11 Nov 2015 02:45:35 +0000

I have been a long proponent of OpenPGP keys for a way to communicate securely. I have used my personal key for signing emails since ~ 2005. I have used my key at dozens and dozens of keysigning parties. I have used my key to store account passwords and credentials with vim(1), Python, and so many other tools. I have used my key to encrypt files to myself. And so much more. There is only one problem.

No one is using my key to send me encrypted data.

Sure, when attending keysigning parties, it's an encryption orgy. Attendees will sign keys, then send encrypted copies to recipients. And, of course, people will send encrypted emails before and after the party, for various reasons. But, when the party dies down, and people get back to their regular lives, and very few actually send encrypted data with OpenPGP keys. Realistically, it's rare. Let's be honest. Sure, there's the one-off, either as a corporation (XMission uses OpenPGP extensively internally for password storage) or individuals (sending encrypted tax forms to a spouse or accountant), but by large, it's rarely used.

I have a good idea of why, and it's nothing ground breaking- OpenPGP is hard. It's hard to create keys. It's hard to manage the keys. It's hard to grasp the necessary concepts of public keys, private keys, encryption, decryption, signatures, verification, the Web of Trust, user identities, key signing parties, revocation certificates, and so much more.

OpenPGP is just hard. Very hard.

Well, in an effort to encourage more people, such as my family and friends that would not use OpenPGP, to encrypt sensitive data, I've jumped on board with miniLock. What is miniLock? Currently, it's a Free Software browser application for the Google Chrome/Chromimum browser (not an extension). It uses ECC, bcrypt, BLAKE2, zxcvbn, and a number of other tools that you really don't need to worry about, unless you want to audit the project. All you need is an email and a password. The keys are deterministically generated based on that.

Think about this for a second. You don't need a public and private keyring to store your keys. You don't need to upload them to a key server. You don't need to attend keysigning parties, worry about the Web of Trust, or any of that other stuff that makes OpenPGP the nightmare it is.

All you need is an email and a password.

Unfortunately, this does have one big drawback- your email or password can't change, without changing your keys. However, the miniLock keys are cheap- IE: you can change them any time, or create as many as you want. You only need to distribute your miniLock ID. In fact, the miniLock ID is the entire public key. So, they don't even need to be long term. Generate a one-time session miniLock ID for some file that you need to send to your accountant during tax season, and call it good.

However, I prefer long-term keys, so as such, I created 3 IDs, one for each email account that I use. If you want to send me encrypted data, without the hassle of OpenPGP, feel free to use the correct miniLock ID for the paired email address.

Email miniLock ID

aaron.toponce@gmail.com mWdv6o7TxCEFq1uN6Q6xiWiBwMc7wzyzCfMa6tVoEPJ5S

atoponce@xmission.com qU7DJqG7UzEWYT316wGQHTo2abUZQk6PG8B6fMwZVC9MN

aaron.toponce@utah.edu 22vDEVchYhUbGY9Wi6EdhsS47EUeLKQAVEVat56HK8Riry

Don't misunderstand me. If you have an OpenPGP key, and would prefer to use that instead, by all means do so. However, if you don't want to setup OpenPGP, and deal with the necessary overhead, I can now decrypt data with miniLock. Maybe that will a better alternative for you instead.

Do XKCD Passwords Work?

Aaron Toponce — Tue, 15 Sep 2015 12:22:33 +0000

You'll always see comments on web forums, social sites, blog posts, and emails about "XKCD passwords". This is of course referring to the XKCD comic by Randall Munroe describing what he thinks is the best password generator:

What no one has bothered asking, is if this actually works.

Lorrie Faith Cranor, director of the Carnegie Mellon Usable Privacy and Security Laboratory at Carnegie Mellon University, a member of the Electronic Frontier Foundation Board of Directors, and Professor in the School of Computer Science and the Engineering and Public Policy Department at Carnegie Mellon University, did ask this question. In fact, she studied to the point, that she gave a TED talk on the subject. The transcript of her talk can be found here. Here are the relevant bits (emphasis mine):

Now another approach to better passwords, perhaps, is to use pass phrases instead of passwords. So this was an xkcd cartoon from a couple of years ago, and the cartoonist suggests that we should all use pass phrases, and if you look at the second row of this cartoon, you can see the cartoonist is suggesting that the pass phrase "correct horse battery staple" would be a very strong pass phrase and something really easy to remember. He says, in fact, you've already remembered it. And so we decided to do a research study to find out whether this was true or not. In fact, everybody who I talk to, who I mention I'm doing password research, they point out this cartoon. "Oh, have you seen it? That xkcd. Correct horse battery staple." So we did the research study to see what would actually happen.

So in our study, we used Mechanical Turk again, and we had the computer pick the random words in the pass phrase. Now the reason we did this is that humans are not very good at picking random words. If we asked a human to do it, they would pick things that were not very random. So we tried a few different conditions. In one condition, the computer picked from a dictionary of the very common words in the English language, and so you'd get pass phrases like "try there three come." And we looked at that, and we said, "Well, that doesn't really seem very memorable." So then we tried picking words that came from specific parts of speech, so how about noun-verb-adjective-noun. That comes up with something that's sort of sentence-like. So you can get a pass phrase like "plan builds sure power" or "end determines red drug." And these seemed a little bit more memorable, and maybe people would like those a little bit better. We wanted to compare them with passwords, and so we had the computer pick random passwords, and these were nice and short, but as you can see, they don't really look very memorable. And then we decided to try something called a pronounceable password. So here the computer picks random syllables and puts them together so you have something sort of pronounceable, like "tufritvi" and "vadasabi." That one kind of rolls off your tongue. So these were random passwords that were generated by our computer.

So what we found in this study was that, surprisingly, pass phrases were not actually all that good. People were not really better at remembering the pass phrases than these random passwords, and because the pass phrases are longer, they took longer to type and people made more errors while typing them in. So it's not really a clear win for pass phrases. Sorry, all of you xkcd fans. On the other hand, we did find that pronounceable passwords worked surprisingly well, and so we actually are doing some more research to see if we can make that approach work even better. So one of the problems with some of the studies that we've done is that because they're all done using Mechanical Turk, these are not people's real passwords. They're the passwords that they created or the computer created for them for our study. And we wanted to know whether people would actually behave the same way with their real passwords.

So, in her research, XKCD passwords really didn't work out that well. They are longer in length, so they take longer to type, which increases the chance for error, and people are no better at remembering on XKCD passphrase, than they are a short string of random characters.

To me, this is unsurprising. If you look at the history of my blogging on passwords, you'll find that I continually advocate true random events to build your passwords, maximizing entropy. In my last post, I even blogged two shell functions that you can use to build XKCD passwords, and "monkey passwords" (monkeys generating passwords by banging away at a keyboard). Both target 80-bits of entropy in the generation. Check out the lengths:

$ gen-monkey-pass 9 cxqwtw63taxdr3zn uaq4tbt43japmm2q mptwrxhhb486yfuv -cb73b9-kgzhmww3 s45t3x6r9smw-7yr hjkgzkha-qup4gh4 34c5rg4ksw-aprvk uug-2vq7pfze6dnp s4qx4eazbnrd2pqe $ gen-xkcd-pass 9 sorestdanklyAlbanyluckyRamonaFowler (sorest dankly Albany lucky Ramona Fowler) towsscareslaudedrobinawardsrenal (tows scares lauded robin awards renal) thinkhazelsvealjuggedagingscareen (think hazels veal jugged agings careen) tarotpapawsNolanpacketAvonwiped (tarot papaws Nolan packet Avon wiped) surgesakimbohardercruelArjunablinds (surges akimbo harder cruel Arjuna blinds) amountlopsedgemeaslyCannoninseam (amount lops edge measly Cannon inseam) EssexIzmirwizesPattygroutszodiac (Essex Izmir wizes Patty grouts zodiac) hoursmailedslamsvowedallowspar (hours mailed slams vowed allow spar) AfghanNigelnutriadillmoldertrolly (Afghan Nigel nutria dill molder trolly)

XKCD passwords average 32 characters to achieve 80-bits of entropy, compared to 16 characters that "monkey passwords" produce. And, according to the research done by Lorrie, people won't necessarily recall XKCD passwords any easier than "monkey passwords". So, if that's the case, then what's the point? Why bother? Why not just create "monkey passwords", and use a password manager?

Exactly. It's 2015. There are password managers for your browser, all versions of every desktop operating system, command-line based utilities for servers, and even apps for your smartphone. There are plenty of "cloud" synchronization services to make sure each instance is up-to-date. At this point, your passwords should:

Contain at least 80-bits of entropy.

Be truly random generated (no influence from you).

Be unique for each and every account.

Be protected with two-factor authentication, where available.

Be stored in a password manager, that is easily accessible.

You'll remember the ones you type in frequently, and you'll memorize them quickly. The others are stored for safe keeping, should you need to recall them.

Password Generation in the Shell

Aaron Toponce — Sat, 05 Sep 2015 13:44:49 +0000

No doubt, some people use password generators- not many, but some. Unfortunately, this means relying on 3rd party utilities, where the source code may not always be available. Personally, I would rather be in full control of the entire generation stack. I know how to make sure plenty of entropy is available in the generation, and I know which sources of entropy to draw on to maximize the entropy estimate. As such, I don't use tools like pwgen(1), apg(1), or anything else. I rely strictly on /dev/urandom, grep(1), and other tools guaranteed to be on every BSD and GNU/Linux operating system.

As such, the script below has been successfully tested in various shells on Debian GNU/Linux, PC-BSD, FreeBSD, OpenBSD, NetBSD, and SmartOS. If you encounter a shell or operating system this script does not work in, please let me know. Thanks to all those who helped me test it and offered suggestions for improvement.

So, with that said, here they are:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# No copyright. Released under the public domain.
# You really should have shuf(1) or shuffle(1) installed. Crazy fast.
shuff(){
if [ $(command -v shuf) ]; then
shuf -n "$1"
elif [ $(command -v shuffle) ]; then
shuffle -f /dev/stdin -p "$1"
else
awk 'BEGIN{
"od -tu4 -N4 -A n /dev/urandom" | getline
srand(0+$0)
}
{print rand()"\t"$0}' | sort -n | cut -f 2 | head -n "$1"
fi
}
gen_monkey_pass(){
I=0
[ $(printf "$1" | grep -E '[0-9]+') ] && NUM="$1" || NUM="1"
until [ "$I" -eq "$NUM" ]; do
I=$((I+1))
LC_CTYPE=C strings /dev/urandom | \
grep -o '[a-hjkmnp-z2-9-]' | head -n 16 | paste -s -d \\0 /dev/stdin
done | column
}
gen_xkcd_pass(){
I=0
[ $(printf "$1" | grep -E '[0-9]+') ] && NUM="$1" || NUM="1"
[ $(uname) = "SunOS" ] && FILE="/usr/dict/words" || FILE="/usr/share/dict/words"
DICT=$(LC_CTYPE=C grep -E '^[a-zA-Z]{3,6}$' "$FILE")
until [ "$I" -eq "$NUM" ]; do
I=$((I+1))
WORDS=$(printf "$DICT" | shuff 6 | paste -s -d ' ' /dev/stdin)
XKCD=$(printf "$WORDS" | sed 's/ //g')
printf "$XKCD ($WORDS)" | awk '{x=$1;$1="";printf "%-36s %s\n", x, $0}'
done | column
}

Nothing fancy about them. The first function, "shuff" is really just a helper function for systems that might not have shuf(1) or shuffle(1) installed. It's used only in the "gen_xkcd_pass" function. The next function, "gen_monkey_pass" acts like monkeys banging on the typewriter. It reads /dev/urandom directly, reading the printable characters that come out of it, counting them to 16, and putting them in an orderly set of columns for output as seen below. The input is a total set of 32-characters, giving each character exactly 5-bits of entropy. So, at 16 characters, each password comes with exactly 80-bits of entropy. The character set was chosen to stay entirely lowercase plus digits, and remain unambiguous, so it's clear, and easy to type, even though it may still be hard to remember. The function can take a numerical argument, for generating exactly that many passwords:

$ gen_monkey_pass 24 awdq2zwwfcdgzqpm t54zqxus77zsu6j6 -2h6dkp93bjdb496 thm9m9nusqxuewny qmsv2vqw-4-q4b4d ttbhpnh4n7nue5g8 ytt6asky765avkpr grwhsfmyz872zwk3 mzq-5ytdv8zawhy6 zb46qgnt62k74xwf uydrsh2axaz5-ymx 6knh32qj4yk885ea vky55q2ubgaucdnh 5dhk9t97pfja9phj rhn2qg734p83wnxs -q2hb833c-54z-9j t33shcc55e3kqcd6 q6fwn3396h4ygvq4 232hr73rkymerpyg u2pq-3ytcpc79nb9 7hqqwqujz4mxa-en jj9vdj3jtpjhwcp6 mqc97ktz-78tb2bp q7-6jug86kqhjfxn

The last function, "gen_xkcd_pass" comes from the "correct horse battery staple" comic from XKCD. On every Unix system, there is a dictionary file installed at /usr/share/dict/words or /usr/dict/words. On Debian GNU/Linux, it contains 99,171 words (OpenBSD contains 234,979!). However, many of them have the apostrophe as a valid character. Taking out any punctuation and digits, we are left with just lowercase and uppercase characters for our words. Further, the total word space is limited to at least 3 characters in length and at most 6 characters in length. This leaves us with 19,198 words, or about 14.229-bits of entropy per word. This means generating at least 6 words to achieve an 80-bit entropy minimum. For clarity, the password is space-separated to the right in parens, to make it more clear what exactly the password is, as shown below. Even if all 6 words have 6 characters (the password is 36 characters in total), the formatted line will never be longer than 80 characters in width, making it fit perfectly in an 80x24 terminal. It also takes a numerical argument, for generating exactly that many passwords:

$ gen_xkcd_pass 8 flyersepticspantearruinedwoo (flyer septic span tear ruined woo) boasgiltCurrywaivegalsAndean (boas gilt Curry waive gals Andean) selectpugjoggedlargeArabicbrood (select pug jogged large Arabic brood) titshubbubAswancartharmedtaxi (tits hubbub Aswan cart harmed taxi) Reaganmodestslowleessamefoster (Reagan modest slow lees same foster) tussleFresnoJensentheirsNohhollow (tussle Fresno Jensen theirs Noh hollow) Laredoriffplunkbarredhikersrearm (Laredo riff plunk barred hikers rearm) demostiffnukesvarlethakegilt (demo stiff nukes varlet hake gilt)

Of course, as you can see, some fairly obscure words pop out as a result, such as "filt" and "rearm". But then, you could think of it as expanding your vocabulary. If you install the "american-insane" dictionary, then you can get about 650,722 words in your total set, bringing your per-word entropy north of 16-bits. This would allow you to cut your number of generated words down to 5 instead of 6, to keep the 80-bits entropy minimum. But then, you also see far more obscure words than with the standard dictionary, and it will take a touch longer to randomize the file.

This script should be platform agnostic. If not, let me know what isn't exactly working in your shell or operating system, and why, and I'll try to address it.

Setting Up A Global VPN Proxy on Android with L2TP/IPSec PSK

Aaron Toponce — Fri, 04 Sep 2015 12:00:26 +0000

In my last post in this short series, I want to discuss how to setup a transparent proxy on your Android phone using the builtin VPN for L2TP. As usual, the same precautions apply here. Don't be stupid with your data, just because you can hide it from your ISP.

In general, I'm skeptical of VPN service providers, which is partially why I'm writing this post. There isn't a VPN provider on this planet that will go to jail for you. And I don't buy into the hype that they aren't logging your traffic. Too often, VPN providers have been all too hasty to turn over user account information and logs, when Big Brother comes knocking. Instead, install strongSwan on your own L2TP VPN server, in a datacenter you trust to handle your traffic, and configure your Android to use that.

Unlike the previous posts, this one does not require root access. To start, you need to navigate to "Settings -> More -> VPN":

Tap the "+" sign to add a new VPN configuration. In this example, we'll configure it to connect to an L2TP/IPSec PSK VPN. As such, you'll need to fill out the server address (pixelated here), and the IPSec pre-shared key. Give the configuration a name, such as "My VPN", and tap "SAVE".



When tapping on the "My VPN" defined configuration, you will be asked to authenticate with your credentials. These can be from the operating system accounting database, LDAP, NIS, or IPSec specific. Provide your username and password, and tap "Save account information" if you want to save the credentials to disk on the phone. Then tap "CONNECT". At this point, you should see a little key in the status bar, confirming that you are indeed connected to the VPN server. If you want, you can create a "VPN" quick-access widget on your home screen, so you can get immediate access to your "My VPN" configuration with a single tap.



Setting Up A Global Tor Proxy on Android with Orbot

Aaron Toponce — Thu, 27 Aug 2015 12:00:25 +0000

In my last post, I explained how to setup a Global SSH proxy on Android with ConnectBot and ProxyDroid. In this article, I'll do the same thing, but with Orbot. Also, as with the last article, the same precautions apply here. If you're on an untrusted or unknown network, using an encrypted proxy can be helpful. However, just because you're using Tor, doesn't mean you should trust its network blindly either. There are all sorts of practical attacks on Tor that have been reaching the press lately, and you would be wise to read them, and proceed with caution.

With that said, sometimes all you want to do is get around a content filter, such as viewing Reddit at church, or getting on Twitter while at work. Of course, there are necessary risks with those actions as well. Basically, don't be an idiot.

With that out of the way, this requires that you have root access on your phone, and that you have installed the Orbot Android app. Once the app is installed, we really only need to make one adjustment, and that is enabling two check boxes: "Transparent Proxying" and "Tor Everything":

As something you should keep in mind, you may also want to check "Use Bridges". Relay bridges are entry nodes that are not listed in the main Tor directory. As such, it is more difficult for ISPs to filter them. If you suspect that your ISP is blocking all known entry nodes, then using bridges can be helpful to get around the problem. But, using bridges may be unnecessary. Check if your Tor connection is getting filtered first. If so, enable the use of bridges, otherwise, you're just fine using Tor without them.

Also, Orbot has some interesting settings, such as specifically setting a whitelist of entry and exit nodes, and a black list of nodes to avoid. If you know someone is operating a Tor node, and you trust them, then I would recommend setting them as either an entry or exit, whichever is appropriate. The reason for this, is it is not impractical for a well-funded organization to have a large number of entry and exit nodes. If so, they can build traffic profiles on who is connecting to the entry node, and which site they are visiting from the exit. However, by specifying specific nodes for either entry or exit (or both), you eliminate this threat. Sadly enough, I could not get this working with Orbot.

One last setting that has caught my eye, is "Tor Tethering". If you use your phone as a wireless hotspot, or USB tethering, you can also transparently route all the traffic from those connected clients through the Tor proxy. I haven't tested this yet with the latest version, but with previous versions of Orbot, it didn't work.

Other settings are listed below, page after page.



When at the main page of the app, long-tap the power button in the center of the droid, to connect to the Tor network. When the arms of the droid are down, you are not connected. When the arms are yellow, and pointing to the sides of the phone, the app is trying to get a connecting to the Tor network. When the arms are green, pointing up, you are fully connected, and can start enjoying your proxy.



Notice that when you are connected, an onion icon is in the status bar at the top of the phone, showing as a permanent notification. If you have "Expanded Notifications" set, you can get IP address and country information in the notification. If you swipe the droid right or left, the droid will spin, and you will end up with a new "Tor Identity". Basically, you'll be connected to a new set of nodes.



Tapping the "CHECK BROWSER" button at the bottom left of the landing screen will use your default browser app to connect to https://check.torproject.org and verify whether or not transparent proxying over Tor is working.

Setting Up A Global SSH Proxy on Android with ConnectBot and ProxyDroid

Aaron Toponce — Wed, 26 Aug 2015 12:00:31 +0000

I'm one that takes precautions with my data when on unfamiliar or untrusted networks. While for the most part, I trust TLS to handle my data securely, I find that it doesn't take much effort to setup a transparent proxy on my Android handset, to route all packets through an encrypted proxy.

In this case, I happen to work for the greatest ISP in the world, and so I have an SSH server in the datacenter. I wholly trust the network from my SSH server to the border routers, so the more traffic I can send that direction, the better. I realize that may not be the case for all of you. However, if you have an externally available SSH server on a trusted network, this post may be of interest.

First, setting up this proxy requires having root. I'm not going to cover how to get root in this post. You can find it elsewhere. Next, you'll need to apps installed; namely ConnectBot and ProxyDroid. Both are Free Software apps. Also, you can do this with SSH Tunnel on its own, if you have Android 4.2.2 or older. Unfortunately, it doesn't work for 4.3 and newer. I have Android 5.1, and it isn't setting up the firewall rules correctly.

Once they are installed, you'll want to set them up. Here I walk through setting up ConnectBot.

Pull up ConnectBot from your app drawer, and setup a new connection by typing in the username, host, and optionally port.

When asked if you want to accept the server's public SSH key, verify the key, then tap "YES"

Enter in your password to connect, and verify that you can successfully connect to the remote SSH server.

Now, disconnect, sending you back to the app's landing screen.



At this point, long-tap the SSH profile you just created, and tap "Edit port forwards".

Tap the menu in the upper-right hand corner of the profile, and tap "Add port forward".

Give the forward a nickname, such as "ProxyDroid".

Tap "Dynamic (SOCKS)" from the list under "Type".

Provide any source port. It must be above 1024, and cannot be currently in use. I find "1984" apropos.

Leave the "Destination" blank, and tap "CREATE PORT FORWARD".



You now have sucessfully created a SOCKS listening port on localhost:1984. Now, we need to create software firewall rules in the phone, to globally forward all packets through localhost on port 1984, creating our transparent proxy. As such, pull up ProxyDroid, and I'll walk you through setting that up:

In ProxyDroid, set "127.0.0.1" as the "Host".

Match the port with what you set in ConnectBot's port forward ("1984" in our example).

Set the "Proxy Type" to "SOCKS5"

Scroll to the bottom of the app, and check the checkbox for "Global Proxy".

OPTIONAL: Check the checkbox for "DNS Proxy".

That last step will tunnel DNS requests through the proxy also. Unfortunately, I have found it to be buggy, and unstable. So, leaving it unchecked, unfortunately, gives you a stable encrypted SSH proxy experience.

Now that both are configured, connect to your remote SSH server with ConnectBot that you have configured, then enable the proxy by tapping the slider next to "Proxy Switch". You should have a running global SSH proxy from your smartphone to the remote SSH server, where all packets are being sent. You can visit a site that returns your external IP address, such as http://findmyipaddress.com/, to verify that the source IP address of the HTTP request is the same IP address as your SSH server. If so, your packets are being tunneled through your SSH connection.

sha256crypt 1-128 characters	sha256crypt 1-512 characters	sha256crypt 1-4,096 characters
sha512crypt 1-128 characters	sha512crypt 1-512 characters	sha512crypt 1-4,096 characters

Frame 1	Frame 2	Difference of frames 1 & 2
Frame 1 maxed luminosity	Frame 2 maxed luminosity	Difference of frames 1 & 2 maxed luminosity

Email	miniLock ID
aaron.toponce@gmail.com	mWdv6o7TxCEFq1uN6Q6xiWiBwMc7wzyzCfMa6tVoEPJ5S
atoponce@xmission.com	qU7DJqG7UzEWYT316wGQHTo2abUZQk6PG8B6fMwZVC9MN
aaron.toponce@utah.edu	22vDEVchYhUbGY9Wi6EdhsS47EUeLKQAVEVat56HK8Riry