Daniel Lemire's articles on arXiv

Daniel Lemire's articles on arXiv 2025-12-12T00:00:00-05:00 http://arxiv.org/a/lemire_d_1 http://arxiv.org/abs/2503.01662v2 2025-03-11T16:47:12-04:00 2025-03-03T10:38:20-05:00 Scanning HTML at Tens of Gigabytes per Second on ARM Processors

Modern processors have instructions to process 16 bytes or more at once. These instructions are called SIMD, for single instruction, multiple data. Recent advances have leveraged SIMD instructions to accelerate parsing of common Internet formats such as JSON and base64. During HTML parsing, they quickly identify specific characters with a strategy called vectorized classification. We review their techniques and compare them with a faster alternative. We measure a 20-fold performance improvement in HTML scanning compared to traditional methods on recent ARM processors. Our findings highlight the potential of SIMD-based algorithms for optimizing Web browser performance.

Daniel Lemire 10.1002/spe.3420 Software: Practice and Experience 55 (7), 2025 http://arxiv.org/abs/2412.16370v1 2024-12-20T17:01:22-05:00 2024-12-20T17:01:22-05:00 Faster Positional-Population Counts for AVX2, AVX-512, and ASIMD

The positional population count operation pospopcnt() counts for an array of w-bit words how often each of the w bits was set. Various applications in bioinformatics, database engineering, and digital processing exist. Building on earlier work by Klarqvist et al., we show how positional population counts can be rapidly computed using SIMD techniques with good performance from the first byte, approaching memory-bound speeds for input arrays of as little as 4 KiB. Improvements include an improved algorithm structure, better handling of unaligned and very short arrays, as well as faster bit-parallel accumulation of intermediate results. We provide a generic algorithm description as well as implementations for various SIMD instruction set extensions, including Intel AVX2, AVX-512, and ARM ASIMD, and discuss the adaption of our algorithm to other platforms.

Robert Clausecker Daniel Lemire Florian Schintke 23 pages, 11 figures. Associated source code can be found on line at https://github.com/clausecker/pospop and https://github.com/lemire/pospopcnt_avx512 http://arxiv.org/abs/2411.12035v2 2024-12-12T01:26:22-05:00 2024-11-18T15:13:05-05:00 Parsing Millions of DNS Records per Second

The Domain Name System (DNS) plays a critical role in the functioning of the Internet. It provides a hierarchical name space for locating resources. Data is typically stored in plain text files, possibly spanning gigabytes. Frequent parsing of these files to refresh the data is computationally expensive: processing a zone file can take minutes. We propose a novel approach called simdzone to enhance DNS parsing throughput. We use data parallelism, specifically the Single Instruction Multiple Data (SIMD) instructions available on commodity processors. We show that we can multiply the parsing speed compared to state-of-the-art parsers found in Knot DNS and the NLnet Labs Name Server Daemon (NSD). The resulting software library replaced the parser in NSD.

Jeroen Koekkoek Daniel Lemire 10.1002/spe.339 software at https://github.com/NLnetLabs/simdzone http://arxiv.org/abs/2408.06213v2 2024-08-20T11:28:12-04:00 2024-08-12T11:07:30-04:00 Batched Ranged Random Integer Generation

Pseudorandom values are often generated as 64-bit binary words. These random words need to be converted into ranged values without statistical bias. We present an efficient algorithm to generate multiple independent uniformly-random bounded integers from a single uniformly-random binary word, without any bias. In the common case, our method uses one multiplication and no division operations per value produced. In practice, our algorithm can more than double the speed of unbiased random shuffling for small to moderately large arrays.

Nevin Brackett-Rozinsky Daniel Lemire 10.1002/spe.3369 software: https://github.com/lemire/batched_random Software: Practice and Experience 55 (1), 2024 http://arxiv.org/abs/2312.17149v3 2024-07-31T22:26:27-04:00 2023-12-28T12:30:25-05:00 On-Demand JSON: A Better Way to Parse Documents?

JSON is a popular standard for data interchange on the Internet. Ingesting JSON documents can be a performance bottleneck. A popular parsing strategy consists in converting the input text into a tree-based data structure -- sometimes called a Document Object Model or DOM. We designed and implemented a novel JSON parsing interface -- called On-Demand -- that appears to the programmer like a conventional DOM-based approach. However, the underlying implementation is a pointer iterating through the content, only materializing the results (objects, arrays, strings, numbers) lazily.On recent commodity processors, an implementation of our approach provides superior performance in multiple benchmarks. To ensure reproducibility, our work is freely available as open source software. Several systems use On-Demand: e.g., Apache Doris, the Node.js JavaScript runtime, Milvus, and Velox.

John Keiser Daniel Lemire 10.1002/spe.3313 Software: Practice and Experience 54 (6), 2024 http://arxiv.org/abs/1902.08318v7 2024-07-23T17:56:05-04:00 2019-02-21T19:24:01-05:00 Parsing Gigabytes of JSON per Second

JavaScript Object Notation or JSON is a ubiquitous data exchange format on the Web. Ingesting JSON documents can become a performance bottleneck due to the sheer volume of data. We are thus motivated to make JSON parsing as fast as possible. Despite the maturity of the problem of JSON parsing, we show that substantial speedups are possible. We present the first standard-compliant JSON parser to process gigabytes of data per second on a single core, using commodity processors. We can use a quarter or fewer instructions than a state-of-the-art reference parser like RapidJSON. Unlike other validating parsers, our software (simdjson) makes extensive use of Single Instruction, Multiple Data (SIMD) instructions. To ensure reproducibility, simdjson is freely available as open-source software under a liberal license.

Geoff Langdale Daniel Lemire 10.1007/s00778-019-00578-5 software: https://github.com/lemire/simdjson The VLDB Journal, 28(6), 2019 http://arxiv.org/abs/2311.10533v2 2023-12-09T10:38:42-05:00 2023-11-17T08:57:25-05:00 Parsing Millions of URLs per Second

URLs are fundamental elements of web applications. By applying vector algorithms, we built a fast standard-compliant C++ implementation. Our parser uses three times fewer instructions than competing parsers following the WHATWG standard (e.g., Servo's rust-url) and up to eight times fewer instructions than the popular curl parser. The Node.js environment adopted our C++ library. In our tests on realistic data, a recent Node.js version (20.0) with our parser is four to five times faster than the last version with the legacy URL parser.

Yagiz Nizipli Daniel Lemire 10.1002/spe.3296 Software: Practice and Experience 54 (5), 2024 http://arxiv.org/abs/2212.05098v4 2023-08-05T13:40:07-04:00 2022-12-09T14:55:19-05:00 Transcoding Unicode Characters with AVX-512 Instructions

Intel includes in its recent processors a powerful set of instructions capable of processing 512-bit registers with a single instruction (AVX-512). Some of these instructions have no equivalent in earlier instruction sets. We leverage these instructions to efficiently transcode strings between the most common formats: UTF-8 and UTF-16. With our novel algorithms, we are often twice as fast as the previous best solutions. For example, we transcode Chinese text from UTF-8 to UTF-16 at more than 5 GiB/s using fewer than 2 CPU instructions per character. To ensure reproducibility, we make our software freely available as an open source library. Our library is part of the popular Node.js JavaScript runtime.

Robert Clausecker Daniel Lemire 10.1002/spe.3261 Software: Practice and Experience 53 (12) 2023 http://arxiv.org/abs/2111.08692v3 2023-05-19T22:00:24-04:00 2021-11-14T18:20:22-05:00 Unicode at Gigabytes per Second

We often represent text using Unicode formats (UTF-8 and UTF-16). The UTF-8 format is increasingly popular, especially on the web (XML, HTML, JSON, Rust, Go, Swift, Ruby). The UTF-16 format is most common in Java, .NET, and inside operating systems such as Windows. Software systems frequently have to convert text from one Unicode format to the other. While recent disks have bandwidths of 5 GiB/s or more, conventional approaches transcode non-ASCII text at a fraction of a gigabyte per second. We show that we can validate and transcode Unicode text at gigabytes per second on current systems (x64 and ARM) without sacrificing safety. Our open-source library can be ten times faster than the popular ICU library on non-ASCII strings and even faster on ASCII strings.

Daniel Lemire 10.1007/978-3-030-86692-1_2 SPIRE 2021: String Processing and Information Retrieval Software: Practice and Experience, Volume52, Issue2 February 2022 http://arxiv.org/abs/2303.14321v1 2023-03-24T21:26:00-04:00 2023-03-24T21:26:00-04:00 Exact Short Products From Truncated Multipliers

We sometimes need to compute the most significant digits of the product of small integers with a multiplier requiring much storage: e.g., a large integer (e.g., $5^{100}$) or an irrational number ($\pi$). We only need to access the most significant digits of the multiplier-as long as the integers are sufficiently small. We provide an efficient algorithm to compute the range of integers given a truncated multiplier and a desired number of digits.

Daniel Lemire 10.1093/comjnl/bxad077 Software at https://github.com/lemire/exactshortlib Computer Journal 67 (4), 2024 http://arxiv.org/abs/2212.06644v2 2023-02-27T22:33:06-05:00 2022-12-13T10:26:46-05:00 Fast Number Parsing Without Fallback

In recent work, Lemire (2021) presented a fast algorithm to convert number strings into binary floating-point numbers. The algorithm has been adopted by several important systems: e.g., it is part of the runtime libraries of GCC 12, Rust 1.55, and Go 1.16. The algorithm parses any number string with a significand containing no more than 19 digits into an IEEE floating-point number. However, there is a check leading to a fallback function to ensure correctness. This fallback function is never called in practice. We prove that the fallback is unnecessary. Thus we can slightly simplify the algorithm and its implementation.

Noble Mushtak Daniel Lemire 10.1002/spe.3198 Software: Practice and Experience 53 (6), 2023 http://arxiv.org/abs/2109.10433v3 2022-11-14T12:17:36-05:00 2021-09-21T16:54:24-04:00 Transcoding Billions of Unicode Characters per Second with SIMD Instructions

In software, text is often represented using Unicode formats (UTF-8 and UTF-16). We frequently have to convert text from one format to the other, a process called transcoding. Popular transcoding functions are slower than state-of-the-art disks and networks. These transcoding functions make little use of the single-instruction-multiple-data (SIMD) instructions available on commodity processors. By designing transcoding algorithms for SIMD instructions, we multiply the speed of transcoding on current systems (x64 and ARM). To ensure reproducibility, we make our software freely available as an open source library.

Daniel Lemire Wojciech Muła 10.1002/spe.3036 Software: https://github.com/simdutf/simdutf Software: Practice and Experience, Volume 52, Issue 2 February 2022 Pages 555-575 http://arxiv.org/abs/2101.11408v9 2022-11-03T14:40:26-04:00 2021-01-11T15:31:27-05:00 Number Parsing at a Gigabyte per Second

With disks and networks providing gigabytes per second, parsing decimal numbers from strings becomes a bottleneck. We consider the problem of parsing decimal numbers to the nearest binary floating-point value. The general problem requires variable-precision arithmetic. However, we need at most 17 digits to represent 64-bit standard floating-point numbers (IEEE 754). Thus we can represent the decimal significand with a single 64-bit word. By combining the significand and precomputed tables, we can compute the nearest floating-point number using as few as one or two 64-bit multiplications. Our implementation can be several times faster than conventional functions present in standard C libraries on modern 64-bit systems (Intel, AMD, ARM and POWER9). Our work is available as open source software used by major systems such as Apache Arrow and Yandex ClickHouse. The Go standard library has adopted a version of our approach.

Daniel Lemire 10.1002/spe.2984 Software at https://github.com/fastfloat/fast_float and https://github.com/lemire/simple_fastfloat_benchmark/ Software: Practice and Experience 51 (8), 2021 http://arxiv.org/abs/1709.07821v6 2022-02-07T10:03:28-05:00 2017-09-22T11:56:49-04:00 Roaring Bitmaps: Implementation of an Optimized Software Library

Compressed bitmap indexes are used in systems such as Git or Oracle to accelerate queries. They represent sets and often support operations such as unions, intersections, differences, and symmetric differences. Several important systems such as Elasticsearch, Apache Spark, Netflix's Atlas, LinkedIn's Pinot, Metamarkets' Druid, Pilosa, Apache Hive, Apache Tez, Microsoft Visual Studio Team Services and Apache Kylin rely on a specific type of compressed bitmap index called Roaring. We present an optimized software library written in C implementing Roaring bitmaps: CRoaring. It benefits from several algorithms designed for the single-instruction-multiple-data (SIMD) instructions available on commodity processors. In particular, we present vectorized algorithms to compute the intersection, union, difference and symmetric difference between arrays. We benchmark the library against a wide range of competitive alternatives, identifying weaknesses and strengths in our software. Our work is available under a liberal open-source license.

Daniel Lemire Owen Kaser Nathan Kurz Luca Deri Chris O'Hara François Saint-Jacques Gregory Ssi-Yan-Kai 10.1002/spe.2560 Software: Practice and Experience Volume 48, Issue 4 April 2018 Pages 867-895 http://arxiv.org/abs/2201.01174v1 2022-01-04T10:05:24-05:00 2022-01-04T10:05:24-05:00 Binary Fuse Filters: Fast and Smaller Than Xor Filters

Bloom and cuckoo filters provide fast approximate set membership while using little memory. Engineers use them to avoid expensive disk and network accesses. The recently introduced xor filters can be faster and smaller than Bloom and cuckoo filters. The xor filters are within 23% of the theoretical lower bound in storage as opposed to 44% for Bloom filters. Inspired by Dietzfelbinger and Walzer, we build probabilistic filters -- called binary fuse filters -- that are within 13% of the storage lower bound -- without sacrificing query speed. As an additional benefit, the construction of the new binary fuse filters can be more than twice as fast as the construction of xor filters. By slightly sacrificing query speed, we further reduce storage to within 8% of the lower bound. We compare the performance against a wide range of competitive alternatives such as Bloom filters, blocked Bloom filters, vector quotient filters, cuckoo filters, and the recent ribbon filters. Our experiments suggest that binary fuse filters are superior to xor filters.

Thomas Mueller Graf Daniel Lemire 10.1145/3510449 Journal of Experimental Algorithmics 27, 2022 http://arxiv.org/abs/2010.03090v4 2021-12-15T13:22:05-05:00 2020-10-06T19:40:03-04:00 Validating UTF-8 In Less Than One Instruction Per Byte

The majority of text is stored in UTF-8, which must be validated on ingestion. We present the lookup algorithm, which outperforms UTF-8 validation routines used in many libraries and languages by more than 10 times using commonly available SIMD instructions. To ensure reproducibility, our work is freely available as open source software.

John Keiser Daniel Lemire 10.1002/spe.2920 Software: Practice and Experience 51 (5), 2021 http://arxiv.org/abs/2012.12369v3 2021-06-29T09:53:43-04:00 2020-12-22T16:44:42-05:00 Integer Division by Constants: Optimal Bounds

The integer division of a numerator n by a divisor d gives a quotient q and a remainder r. Optimizing compilers accelerate software by replacing the division of n by d with the division of c * n (or c * n + c) by m for convenient integers c and m chosen so that they approximate the reciprocal: c/m ~= 1/d. Such techniques are especially advantageous when m is chosen to be a power of two and when d is a constant so that c and m can be precomputed. The literature contains many bounds on the distance between c/m and the divisor d. Some of these bounds are optimally tight, while others are not. We present optimally tight bounds for quotient and remainder computations.

Daniel Lemire Colin Bartlett Owen Kaser 10.1016/j.heliyon.2021.e07442 Heliyon 7 (6), 2021 http://arxiv.org/abs/1911.02696v6 2021-05-11T17:15:54-04:00 2019-11-06T20:06:38-05:00 Efficient Computation of Positional Population Counts Using SIMD Instructions

In several fields such as statistics, machine learning, and bioinformatics, categorical variables are frequently represented as one-hot encoded vectors. For example, given 8 distinct values, we map each value to a byte where only a single bit has been set. We are motivated to quickly compute statistics over such encodings. Given a stream of k-bit words, we seek to compute k distinct sums corresponding to bit values at indexes 0, 1, 2, ..., k-1. If the k-bit words are one-hot encoded then the sums correspond to a frequency histogram. This multiple-sum problem is a generalization of the population-count problem where we seek the sum of all bit values. Accordingly, we refer to the multiple-sum problem as a positional population-count. Using SIMD (Single Instruction, Multiple Data) instructions from recent Intel processors, we describe algorithms for computing the 16-bit position population count using less than half of a CPU cycle per 16-bit word. Our best approach uses up to 400 times fewer instructions and is up to 50 times faster than baseline code using only regular (non-SIMD) instructions, for sufficiently large inputs.

Marcus D. R. Klarqvist Wojciech Muła Daniel Lemire 10.1002/cpe.6304 Concurrency and Computation: Practice and Experience 33 (17), 2021 http://arxiv.org/abs/1209.2137v7 2021-01-30T13:23:38-05:00 2012-09-10T16:08:03-04:00 Decoding billions of integers per second through vectorization

In many important applications -- such as search engines and relational database systems -- data is stored in the form of arrays of integers. Encoding and, most importantly, decoding of these arrays consumes considerable CPU time. Therefore, substantial effort has been made to reduce costs associated with compression and decompression. In particular, researchers have exploited the superscalar nature of modern processors and SIMD instructions. Nevertheless, we introduce a novel vectorized scheme called SIMD-BP128 that improves over previously proposed vectorized approaches. It is nearly twice as fast as the previously fastest schemes on desktop processors (varint-G8IU and PFOR). At the same time, SIMD-BP128 saves up to 2 bits per integer. For even better compression, we propose another new vectorized scheme (SIMD-FastPFOR) that has a compression ratio within 10% of a state-of-the-art scheme (Simple-8b) while being two times faster during decoding.

Daniel Lemire Leonid Boytsov 10.1002/spe.2203 For software, see https://github.com/lemire/FastPFor, For data, see http://boytsov.info/datasets/clueweb09gap/ Software: Practice & Experience 45 (1), 2015 http://arxiv.org/abs/cs/0610128v3 2020-04-26T17:39:09-04:00 2006-10-20T20:30:57-04:00 Hierarchical Bin Buffering: Online Local Moments for Dynamic External Memory Arrays

Local moments are used for local regression, to compute statistical measures such as sums, averages, and standard deviations, and to approximate probability distributions. We consider the case where the data source is a very large I/O array of size n and we want to compute the first N local moments, for some constant N. Without precomputation, this requires O(n) time. We develop a sequence of algorithms of increasing sophistication that use precomputation and additional buffer space to speed up queries. The simpler algorithms partition the I/O array into consecutive ranges called bins, and they are applicable not only to local-moment queries, but also to algebraic queries (MAX, AVERAGE, SUM, etc.). With N buffers of size sqrt{n}, time complexity drops to O(sqrt n). A more sophisticated approach uses hierarchical buffering and has a logarithmic time complexity (O(b log_b n)), when using N hierarchical buffers of size n/b. Using Overlapped Bin Buffering, we show that only a single buffer is needed, as with wavelet-based algorithms, but using much less storage. Applications exist in multidimensional and statistical databases over massive data sets, interactive image processing, and visualization.

Daniel Lemire Owen Kaser 10.1145/1328911.1328925 ACM Transactions on Algorithms 4(1): 14 (2008) http://arxiv.org/abs/1401.6399v13 2020-04-20T15:42:24-04:00 2014-01-24T11:53:37-05:00 SIMD Compression and the Intersection of Sorted Integers

Sorted lists of integers are commonly used in inverted indexes and database systems. They are often compressed in memory. We can use the SIMD instructions available in common processors to boost the speed of integer compression schemes. Our S4-BP128-D4 scheme uses as little as 0.7 CPU cycles per decoded integer while still providing state-of-the-art compression. However, if the subsequent processing of the integers is slow, the effort spent on optimizing decoding speed can be wasted. To show that it does not have to be so, we (1) vectorize and optimize the intersection of posting lists; (2) introduce the SIMD Galloping algorithm. We exploit the fact that one SIMD instruction can compare 4 pairs of integers at once. We experiment with two TREC text collections, GOV2 and ClueWeb09 (Category B), using logs from the TREC million-query track. We show that using only the SIMD instructions ubiquitous in all modern CPUs, our techniques for conjunctive queries can double the speed of a state-of-the-art approach.

Daniel Lemire Leonid Boytsov Nathan Kurz 10.1002/spe.2326 Software: Practice and Experience Volume 46, Issue 6, pages 723-749, June 2016 http://arxiv.org/abs/1912.08258v3 2020-01-27T13:27:49-05:00 2019-12-17T15:07:53-05:00 Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters

The Bloom filter provides fast approximate set membership while using little memory. Engineers often use these filters to avoid slow operations such as disk or network accesses. As an alternative, a cuckoo filter may need less space than a Bloom filter and it is faster. Chazelle et al. proposed a generalization of the Bloom filter called the Bloomier filter. Dietzfelbinger and Pagh described a variation on the Bloomier filter that can be used effectively for approximate membership queries. It has never been tested empirically, to our knowledge. We review an efficient implementation of their approach, which we call the xor filter. We find that xor filters can be faster than Bloom and cuckoo filters while using less memory. We further show that a more compact version of xor filters (xor+) can use even less space than highly compact alternatives (e.g., Golomb-compressed sequences) while providing speeds competitive with Bloom filters.

Thomas Mueller Graf Daniel Lemire 10.1145/3376122 Journal of Experimental Algorithmics 25 (1), 2020 http://arxiv.org/abs/1902.01961v3 2019-11-20T11:52:47-05:00 2019-02-05T17:33:20-05:00 Faster Remainder by Direct Computation: Applications to Compilers and Software Libraries

On common processors, integer multiplication is many times faster than integer division. Dividing a numerator n by a divisor d is mathematically equivalent to multiplication by the inverse of the divisor (n / d = n x 1/d). If the divisor is known in advance---or if repeated integer divisions will be performed with the same divisor---it can be beneficial to substitute a less costly multiplication for an expensive division. Currently, the remainder of the division by a constant is computed from the quotient by a multiplication and a subtraction. But if just the remainder is desired and the quotient is unneeded, this may be suboptimal. We present a generally applicable algorithm to compute the remainder more directly. Specifically, we use the fractional portion of the product of the numerator and the inverse of the divisor. On this basis, we also present a new, simpler divisibility algorithm to detect nonzero remainders. We also derive new tight bounds on the precision required when representing the inverse of the divisor. Furthermore, we present simple C implementations that beat the optimized code produced by state-of-art C compilers on recent x64 processors (e.g., Intel Skylake and AMD Ryzen), sometimes by more than 25%. On all tested platforms including 64-bit ARM and POWER8, our divisibility-test functions are faster than state-of-the-art Granlund-Montgomery divisibility-test functions, sometimes by more than 50%.

Daniel Lemire Owen Kaser Nathan Kurz 10.1002/spe.2689 Software: Practice and Experience 49 (6), 2019 http://arxiv.org/abs/1910.05109v1 2019-10-02T11:39:28-04:00 2019-10-02T11:39:28-04:00 Base64 encoding and decoding at almost the speed of a memory copy

Many common document formats on the Internet are text-only such as email (MIME) and the Web (HTML, JavaScript, JSON and XML). To include images or executable code in these documents, we first encode them as text using base64. Standard base64 encoding uses 64~ASCII characters: both lower and upper case Latin letters, digits and two other symbols. We show how we can encode and decode base64 data at nearly the speed of a memory copy (memcpy) on recent Intel processors, as long as the data does not fit in the first-level (L1) cache. We use the SIMD (Single Instruction Multiple Data) instruction set AVX-512 available on commodity processors. Our implementation generates several times fewer instructions than previous SIMD-accelerated base64 codecs. It is also more versatile, as it can be adapted---even at runtime---to any base64 variant by only changing constants.

Wojciech Muła Daniel Lemire 10.1002/spe.2777 Software: Practice and Experience 50 (2), 2020 http://arxiv.org/abs/1805.10941v4 2018-12-28T11:18:11-05:00 2018-05-28T10:28:04-04:00 Fast Random Integer Generation in an Interval

In simulations, probabilistic algorithms and statistical tests, we often generate random integers in an interval (e.g., [0,s)). For example, random integers in an interval are essential to the Fisher-Yates random shuffle. Consequently, popular languages like Java, Python, C++, Swift and Go include ranged random integer generation functions as part of their runtime libraries. Pseudo-random values are usually generated in words of a fixed number of bits (e.g., 32 bits, 64 bits) using algorithms such as a linear congruential generator. We need functions to convert such random words to random integers in an interval ([0,s)) without introducing statistical biases. The standard functions in programming languages such as Java involve integer divisions. Unfortunately, division instructions are relatively expensive. We review an unbiased function to generate ranged integers from a source of random words that avoids integer divisions with high probability. To establish the practical usefulness of the approach, we show that this algorithm can multiply the speed of unbiased random shuffling on x64 processors. Our proposed approach has been adopted by the Go language for its implementation of the shuffle function.

Daniel Lemire 10.1145/3230636 to appear in ACM Transactions on Modeling and Computer Simulation ACM Transactions on Modeling and Computer Simulation 29 (1), 2019 http://arxiv.org/abs/1810.05313v2 2018-10-25T21:30:24-04:00 2018-10-11T21:40:57-04:00 Xorshift1024*, Xorshift1024+, Xorshift128+ and Xoroshiro128+ Fail Statistical Tests for Linearity

L'Ecuyer & Simard's Big Crush statistical test suite has revealed statistical flaws in many popular random number generators including Marsaglia's Xorshift generators. Vigna recently proposed some 64-bit variations on the Xorshift scheme that are further scrambled (i.e., Xorshift1024*, Xorshift1024+, Xorshift128+, Xoroshiro128+). Unlike their unscrambled counterparts, they pass Big Crush when interleaving blocks of 32 bits for each 64-bit word (most significant, least significant, most significant, least significant, etc.). We report that these scrambled generators systematically fail Big Crush---specifically the linear-complexity and matrix-rank tests that detect linearity---when taking the 32 lowest-order bits in reverse order from each 64-bit word.

Daniel Lemire Melissa E. O'Neill 10.1016/j.cam.2018.10.019 Computational and Applied Mathematics 350, 2019 http://arxiv.org/abs/1202.4961v6 2018-09-20T15:33:55-04:00 2012-02-22T11:34:24-05:00 Strongly universal string hashing is fast

We present fast strongly universal string hashing families: they can process data at a rate of 0.2 CPU cycle per byte. Maybe surprisingly, we find that these families---though they require a large buffer of random numbers---are often faster than popular hash functions with weaker theoretical guarantees. Moreover, conventional wisdom is that hash functions with fewer multiplications are faster. Yet we find that they may fail to be faster due to operation pipelining. We present experimental results on several processors including low-powered processors. Our tests include hash functions designed for processors with the Carry-Less Multiplication (CLMUL) instruction set. We also prove, using accessible proofs, the strong universality of our families.

Owen Kaser Daniel Lemire 10.1093/comjnl/bxt070 Software is available at http://code.google.com/p/variablelengthstringhashing/ and https://github.com/lemire/StronglyUniversalStringHashing Computer Journal (2014) 57 (11): 1624-1638 http://arxiv.org/abs/1611.07612v9 2018-09-05T15:35:03-04:00 2016-11-22T21:47:50-05:00 Faster Population Counts Using AVX2 Instructions

Counting the number of ones in a binary stream is a common operation in database, information-retrieval, cryptographic and machine-learning applications. Most processors have dedicated instructions to count the number of ones in a word (e.g., popcnt on x64 processors). Maybe surprisingly, we show that a vectorized approach using SIMD instructions can be twice as fast as using the dedicated instructions on recent Intel processors. The benefits can be even greater for applications such as similarity measures (e.g., the Jaccard index) that require additional Boolean operations. Our approach has been adopted by LLVM: it is used by its popular C compiler (clang).

Wojciech Muła Nathan Kurz Daniel Lemire 10.1093/comjnl/bxx046 Software is at https://github.com/CountOnes/hamming_weight Computer Journal, Volume 61, Issue 1, 1 January 2018 http://arxiv.org/abs/1704.00605v5 2018-06-14T17:02:13-04:00 2017-03-30T15:04:09-04:00 Faster Base64 Encoding and Decoding Using AVX2 Instructions

Web developers use base64 formats to include images, fonts, sounds and other resources directly inside HTML, JavaScript, JSON and XML files. We estimate that billions of base64 messages are decoded every day. We are motivated to improve the efficiency of base64 encoding and decoding. Compared to state-of-the-art implementations, we multiply the speeds of both the encoding (~10x) and the decoding (~7x). We achieve these good results by using the single-instruction-multiple-data (SIMD) instructions available on recent Intel processors (AVX2). Our accelerated software abides by the specification and reports errors when encountering characters outside of the base64 set. It is available online as free software under a liberal license.

Wojciech Muła Daniel Lemire 10.1145/3132709 software at https://github.com/lemire/fastbase64 ACM Transactions on the Web 12 (3), 2018 http://arxiv.org/abs/1603.06549v4 2018-03-02T13:35:46-05:00 2016-03-21T15:30:53-04:00 Consistently faster and smaller compressed bitmaps with Roaring

Compressed bitmap indexes are used in databases and search engines. Many bitmap compression techniques have been proposed, almost all relying primarily on run-length encoding (RLE). However, on unsorted data, we can get superior performance with a hybrid compression technique that uses both uncompressed bitmaps and packed arrays inside a two-level tree. An instance of this technique, Roaring, has recently been proposed. Due to its good performance, it has been adopted by several production platforms (e.g., Apache Lucene, Apache Spark, Apache Kylin and Druid). Yet there are cases where run-length encoded bitmaps are smaller than the original Roaring bitmaps---typically when the data is sorted so that the bitmaps contain long compressible runs. To better handle these cases, we build a new Roaring hybrid that combines uncompressed bitmaps, packed arrays and RLE compressed segments. The result is a new Roaring format that compresses better. Overall, our new implementation of Roaring can be several times faster (up to two orders of magnitude) than the implementations of traditional RLE-based alternatives (WAH, Concise, EWAH) while compressing better. We review the design choices and optimizations that make these good results possible.

Daniel Lemire Gregory Ssi-Yan-Kai Owen Kaser 10.1002/spe.2402 Software: Practice and Experience Volume 46, Issue 11, pages 1547-1569, November 2016 http://arxiv.org/abs/1802.10233v1 2018-02-27T21:10:36-05:00 2018-02-27T21:10:36-05:00 Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources

Apache Calcite is a foundational software framework that provides query processing, optimization, and query language support to many popular open-source data processing systems such as Apache Hive, Apache Storm, Apache Flink, Druid, and MapD. Calcite's architecture consists of a modular and extensible query optimizer with hundreds of built-in optimization rules, a query processor capable of processing a variety of query languages, an adapter architecture designed for extensibility, and support for heterogeneous data models and stores (relational, semi-structured, streaming, and geospatial). This flexible, embeddable, and extensible architecture is what makes Calcite an attractive choice for adoption in big-data frameworks. It is an active project that continues to introduce support for the new types of data sources, query languages, and approaches to query processing and optimization.

Edmon Begoli Jesús Camacho Rodríguez Julian Hyde Michael J. Mior Daniel Lemire 10.1145/3183713.3190662 SIGMOD'18 http://arxiv.org/abs/1709.08990v2 2017-09-27T15:53:20-04:00 2017-09-25T10:45:20-04:00 Stream VByte: Faster Byte-Oriented Integer Compression

Arrays of integers are often compressed in search engines. Though there are many ways to compress integers, we are interested in the popular byte-oriented integer compression techniques (e.g., VByte or Google's Varint-GB). They are appealing due to their simplicity and engineering convenience. Amazon's varint-G8IU is one of the fastest byte-oriented compression technique published so far. It makes judicious use of the powerful single-instruction-multiple-data (SIMD) instructions available in commodity processors. To surpass varint-G8IU, we present Stream VByte, a novel byte-oriented compression technique that separates the control stream from the encoded data. Like varint-G8IU, Stream VByte is well suited for SIMD instructions. We show that Stream VByte decoding can be up to twice as fast as varint-G8IU decoding over real data sets. In this sense, Stream VByte establishes new speed records for byte-oriented integer compression, at times exceeding the speed of the memcpy function. On a 3.4GHz Haswell processor, it decodes more than 4 billion differentially-coded integers per second from RAM to L1 cache.

Daniel Lemire Nathan Kurz Christoph Rupp 10.1016/j.ipl.2017.09.011 Information Processing Letters 130, February 2018, Pages 1-6 http://arxiv.org/abs/1703.08198v2 2017-04-06T16:08:04-04:00 2017-03-23T14:29:03-04:00 On Desirable Semantics of Functional Dependencies over Databases with Incomplete Information

Codd's relational model describes just one possible world. To better cope with incomplete information, extended database models allow several possible worlds. Vague tables are one such convenient extended model where attributes accept sets of possible values (e.g., the manager is either Jill or Bob). However, conceptual database design in such cases remains an open problem. In particular, there is no canonical definition of functional dependencies (FDs) over possible worlds (e.g., each employee has just one manager). We identify several desirable properties that the semantics of such FDs should meet including Armstrong's axioms, the independence from irrelevant attributes, seamless satisfaction and implied by strong satisfaction. We show that we can define FDs such that they have all our desirable properties over vague tables. However, we also show that no notion of FD can satisfy all our desirable properties over a more general model (disjunctive tables). Our work formalizes a trade-off between having a general model and having well-behaved FDs.

Antonio Badia Daniel Lemire 10.3233/FI-2018-1651 to appear in Fundamenta Informaticae Fundamenta Informaticae 158 (2018) 327-352 http://arxiv.org/abs/1503.07387v4 2017-01-13T23:05:36-05:00 2015-02-20T09:52:06-05:00 Vectorized VByte Decoding

We consider the ubiquitous technique of VByte compression, which represents each integer as a variable length sequence of bytes. The low 7 bits of each byte encode a portion of the integer, and the high bit of each byte is reserved as a continuation flag. This flag is set to 1 for all bytes except the last, and the decoding of each integer is complete when a byte with a high bit of 0 is encountered. VByte decoding can be a performance bottleneck especially when the unpredictable lengths of the encoded integers cause frequent branch mispredictions. Previous attempts to accelerate VByte decoding using SIMD vector instructions have been disappointing, prodding search engines such as Google to use more complicated but faster-to-decode formats for performance-critical code. Our decoder (Masked VByte) is 2 to 4 times faster than a conventional scalar VByte decoder, making the format once again competitive with regard to speed.

Jeff Plaisance Nathan Kurz Daniel Lemire First International Symposium on Web Algorithms (June 2015) http://arxiv.org/abs/1611.05428v2 2017-01-09T10:40:05-05:00 2016-11-16T15:17:07-05:00 Upscaledb: Efficient Integer-Key Compression in a Key-Value Store using SIMD Instructions

Compression can sometimes improve performance by making more of the data available to the processors faster. We consider the compression of integer keys in a B+-tree index. For this purpose, systems such as IBM DB2 use variable-byte compression over differentially coded keys. We revisit this problem with various compression alternatives such as Google's VarIntGB, Binary Packing and Frame-of-Reference. In all cases, we describe algorithms that can operate directly on compressed data. Many of our alternatives exploit the single-instruction-multiple-data (SIMD) instructions supported by modern CPUs. We evaluate our techniques in a database environment provided by Upscaledb, a production-quality key-value database. Our best techniques are SIMD accelerated: they simultaneously reduce memory usage while improving single-threaded speeds. In particular, a differentially coded SIMD binary-packing techniques (BP128) can offer a superior query speed (e.g., 40% better than an uncompressed database) while providing the best compression (e.g., by a factor of ten). For analytic workloads, our fast compression techniques offer compelling benefits. Our software is available as open source.

Daniel Lemire Christoph Rupp 10.1016/j.is.2017.01.002 Information Systems Volume 66, June 2017, Pages 13-23 http://arxiv.org/abs/1402.4073v2 2016-11-14T20:59:54-05:00 2014-02-17T12:22:05-05:00 Threshold and Symmetric Functions over Bitmaps

Bitmap indexes are routinely used to speed up simple aggregate queries in databases. Set operations such as intersections, unions and complements can be represented as logical operations (AND, OR, NOT). However, less is known about the application of bitmap indexes to more advanced queries. We want to extend the applicability of bitmap indexes. As a starting point, we consider symmetric Boolean queries (e.g., threshold functions). For example, we might consider stores as sets of products, and ask for products that are on sale in 2 to 10 stores. Such symmetric Boolean queries generalize intersection, union, and T-occurrence queries. It may not be immediately obvious to an engineer how to use bitmap indexes for symmetric Boolean queries. Yet, maybe surprisingly, we find that the best of our bitmap-based algorithms are competitive with the state-of-the-art algorithms for important special cases (e.g., MergeOpt, MergeSkip, DivideSkip, ScanCount). Moreover, unlike the competing algorithms, the result of our computation is again a bitmap which can be further processed within a bitmap index. We review algorithmic design issues such as the aggregation of many compressed bitmaps. We conclude with a discussion on other advanced queries that bitmap indexes might be able to support efficiently.

Owen Kaser Daniel Lemire This paper uses small fonts and colours and is only intended for electronic viewing http://arxiv.org/abs/1609.09840v2 2016-10-18T14:54:11-04:00 2016-09-30T14:01:25-04:00 Regular and almost universal hashing: an efficient implementation

Random hashing can provide guarantees regarding the performance of data structures such as hash tables---even in an adversarial setting. Many existing families of hash functions are universal: given two data objects, the probability that they have the same hash value is low given that we pick hash functions at random. However, universality fails to ensure that all hash functions are well behaved. We further require regularity: when picking data objects at random they should have a low probability of having the same hash value, for any fixed hash function. We present the efficient implementation of a family of non-cryptographic hash functions (PM+) offering good running times, good memory usage as well as distinguishing theoretical guarantees: almost universality and component-wise regularity. On a variety of platforms, our implementations are comparable to the state of the art in performance. On recent Intel processors, PM+ achieves a speed of 4.7 bytes per cycle for 32-bit outputs and 3.3 bytes per cycle for 64-bit outputs. We review vectorization through SIMD instructions (e.g., AVX2) and optimizations for superscalar execution.

Dmytro Ivanchykhin Sergey Ignatchenko Daniel Lemire 10.1002/spe.2461 accepted for publication in Software: Practice and Experience in September 2016 Software: Practice and Experience 47 (10), 2017 http://arxiv.org/abs/1501.01941v3 2016-09-21T14:37:56-04:00 2015-01-08T15:04:46-05:00 Bloofi: Multidimensional Bloom Filters

Bloom filters are probabilistic data structures commonly used for approximate membership problems in many areas of Computer Science (networking, distributed systems, databases, etc.). With the increase in data size and distribution of data, problems arise where a large number of Bloom filters are available, and all them need to be searched for potential matches. As an example, in a federated cloud environment, each cloud provider could encode the information using Bloom filters and share the Bloom filters with a central coordinator. The problem of interest is not only whether a given element is in any of the sets represented by the Bloom filters, but which of the existing sets contain the given element. This problem cannot be solved by just constructing a Bloom filter on the union of all the sets. Instead, we effectively have a multidimensional Bloom filter problem: given an element, we wish to receive a list of candidate sets where the element might be. To solve this problem, we consider 3 alternatives. Firstly, we can naively check many Bloom filters. Secondly, we propose to organize the Bloom filters in a hierarchical index structure akin to a B+ tree, that we call Bloofi. Finally, we propose another data structure that packs the Bloom filters in such a way as to exploit bit-level parallelism, which we call Flat-Bloofi. Our theoretical and experimental results show that Bloofi and Flat-Bloofi provide scalable and efficient solutions alternatives to search through a large number of Bloom filters.

Adina Crainiceanu Daniel Lemire 10.1016/j.is.2015.01.002 Information Systems Volume 54, December 2015, Pages 311-324 http://arxiv.org/abs/0707.1913v3 2016-08-22T17:02:58-04:00 2007-07-12T22:30:10-04:00 Removing Manually-Generated Boilerplate from Electronic Texts: Experiments with Project Gutenberg e-Books

Collaborative work on unstructured or semi-structured documents, such as in literature corpora or source code, often involves agreed upon templates containing metadata. These templates are not consistent across users and over time. Rule-based parsing of these templates is expensive to maintain and tends to fail as new documents are added. Statistical techniques based on frequent occurrences have the potential to identify automatically a large fraction of the templates, thus reducing the burden on the programmers. We investigate the case of the Project Gutenberg corpus, where most documents are in ASCII format with preambles and epilogues that are often copied and pasted or manually typed. We show that a statistical approach can solve most cases though some documents require knowledge of English. We also survey various technical solutions that make our approach applicable to large data sets.

Owen Kaser Daniel Lemire short version appeared in CASCON 2007 proceedings, available from http://portal.acm.org/citation.cfm?id=1321246 Source code at https://github.com/lemire/gutenberg-headers http://arxiv.org/abs/0901.3751v7 2016-07-29T16:24:07-04:00 2009-01-23T14:01:06-05:00 Sorting improves word-aligned bitmap indexes

Bitmap indexes must be compressed to reduce input/output costs and minimize CPU usage. To accelerate logical operations (AND, OR, XOR) over bitmaps, we use techniques based on run-length encoding (RLE), such as Word-Aligned Hybrid (WAH) compression. These techniques are sensitive to the order of the rows: a simple lexicographical sort can divide the index size by 9 and make indexes several times faster. We investigate row-reordering heuristics. Simply permuting the columns of the table can increase the sorting efficiency by 40%. Secondary contributions include efficient algorithms to construct and aggregate bitmaps. The effect of word length is also reviewed by constructing 16-bit, 32-bit and 64-bit indexes. Using 64-bit CPUs, we find that 64-bit indexes are slightly faster than 32-bit indexes despite being nearly twice as large.

Daniel Lemire Owen Kaser Kamel Aouiche 10.1016/j.datak.2009.08.006 Data & Knowledge Engineering, Volume 69, Issue 1, 2010, Pages 3-28 http://arxiv.org/abs/0705.4676v8 2016-06-06T11:18:03-04:00 2007-05-31T14:41:28-04:00 Recursive n-gram hashing is pairwise independent, at best

Many applications use sequences of n consecutive symbols (n-grams). Hashing these n-grams can be a performance bottleneck. For more speed, recursive hash families compute hash values by updating previous values. We prove that recursive hash families cannot be more than pairwise independent. While hashing by irreducible polynomials is pairwise independent, our implementations either run in time O(n) or use an exponential amount of memory. As a more scalable alternative, we make hashing by cyclic polynomials pairwise independent by ignoring n-1 bits. Experimentally, we show that hashing by cyclic polynomials is is twice as fast as hashing by irreducible polynomials. We also show that randomized Karp-Rabin hash families are not pairwise independent.

Daniel Lemire Owen Kaser 10.1016/j.csl.2009.12.001 See software at https://github.com/lemire/rollinghashcpp Computer Speech & Language 24(4): 698-710 (2010) http://arxiv.org/abs/0905.2657v2 2016-03-15T17:52:24-04:00 2009-05-16T01:57:06-04:00 Web 2.0 OLAP: From Data Cubes to Tag Clouds

Increasingly, business projects are ephemeral. New Business Intelligence tools must support ad-lib data sources and quick perusal. Meanwhile, tag clouds are a popular community-driven visualization technique. Hence, we investigate tag-cloud views with support for OLAP operations such as roll-ups, slices, dices, clustering, and drill-downs. As a case study, we implemented an application where users can upload data and immediately navigate through its ad hoc dimensions. To support social networking, views can be easily shared and embedded in other Web sites. Algorithmically, our tag-cloud views are approximate range top-k queries over spontaneous data cubes. We present experimental evidence that iceberg cuboids provide adequate online approximations. We benchmark several browser-oblivious tag-cloud layout optimizations.

Kamel Aouiche Daniel Lemire Robert Godin 10.1007/978-3-642-01344-7_5 Software at https://github.com/lemire/OLAPTagCloud. arXiv admin note: substantial text overlap with arXiv:0710.2156 Lecture Notes in Business Information Processing Vol. 18, pages 51-64, 2009 http://arxiv.org/abs/0710.2156v3 2016-03-15T17:52:12-04:00 2007-10-11T15:48:10-04:00 Collaborative OLAP with Tag Clouds: Web 2.0 OLAP Formalism and Experimental Evaluation

Kamel Aouiche Daniel Lemire Robert Godin Software at https://github.com/lemire/OLAPTagCloud http://arxiv.org/abs/1402.6407v10 2016-03-15T15:31:53-04:00 2014-02-25T23:38:22-05:00 Better bitmap performance with Roaring bitmaps

Bitmap indexes are commonly used in databases and search engines. By exploiting bit-level parallelism, they can significantly accelerate queries. However, they can use much memory, and thus we might prefer compressed bitmap indexes. Following Oracle's lead, bitmaps are often compressed using run-length encoding (RLE). Building on prior work, we introduce the Roaring compressed bitmap format: it uses packed arrays for compression instead of RLE. We compare it to two high-performance RLE-based bitmap encoding techniques: WAH (Word Aligned Hybrid compression scheme) and Concise (Compressed `n' Composable Integer Set). On synthetic and real data, we find that Roaring bitmaps (1) often compress significantly better (e.g., 2 times) and (2) are faster than the compressed alternatives (up to 900 times faster for intersections). Our results challenge the view that RLE-based bitmap compression is best.

Samy Chambi Daniel Lemire Owen Kaser Robert Godin 10.1002/spe.2325 Software: Practice and Experience Volume 46, Issue 5, pages 709-719, May 2016 http://arxiv.org/abs/1503.03465v8 2015-11-04T11:34:39-05:00 2015-03-11T15:47:09-04:00 Faster 64-bit universal hashing using carry-less multiplications

Intel and AMD support the Carry-less Multiplication (CLMUL) instruction set in their x64 processors. We use CLMUL to implement an almost universal 64-bit hash family (CLHASH). We compare this new family with what might be the fastest almost universal family on x64 processors (VHASH). We find that CLHASH is at least 60% faster. We also compare CLHASH with a popular hash function designed for speed (Google's CityHash). We find that CLHASH is 40% faster than CityHash on inputs larger than 64 bytes and just as fast otherwise.

Daniel Lemire Owen Kaser 10.1007/s13389-015-0110-5 Journal of Cryptographic Engineering, Volume 6, Issue 3, pp 171-185, 2016 http://arxiv.org/abs/1502.01916v1 2015-02-06T10:11:47-05:00 2015-02-06T10:11:47-05:00 A General SIMD-based Approach to Accelerating Compression Algorithms

Compression algorithms are important for data oriented tasks, especially in the era of Big Data. Modern processors equipped with powerful SIMD instruction sets, provide us an opportunity for achieving better compression performance. Previous research has shown that SIMD-based optimizations can multiply decoding speeds. Following these pioneering studies, we propose a general approach to accelerate compression algorithms. By instantiating the approach, we have developed several novel integer compression algorithms, called Group-Simple, Group-Scheme, Group-AFOR, and Group-PFD, and implemented their corresponding vectorized versions. We evaluate the proposed algorithms on two public TREC datasets, a Wikipedia dataset and a Twitter dataset. With competitive compression ratios and encoding speeds, our SIMD-based algorithms outperform state-of-the-art non-vectorized algorithms with respect to decoding speeds.

Wayne Xin Zhao Xudong Zhang Daniel Lemire Dongdong Shan Jian-Yun Nie Hongfei Yan Ji-Rong Wen 10.1145/2735629 ACM Trans. Inf. Syst. 33, 3, Article 15 (March 2015) http://arxiv.org/abs/1501.06587v1 2015-01-26T16:06:02-05:00 2015-01-26T16:06:02-05:00 Measuring academic influence: Not all citations are equal

The importance of a research article is routinely measured by counting how many times it has been cited. However, treating all citations with equal weight ignores the wide variety of functions that citations perform. We want to automatically identify the subset of references in a bibliography that have a central academic influence on the citing paper. For this purpose, we examine the effectiveness of a variety of features for determining the academic influence of a citation. By asking authors to identify the key references in their own work, we created a data set in which citations were labeled according to their academic influence. Using automatic feature selection with supervised machine learning, we found a model for predicting academic influence that achieves good performance on this data set using only four features. The best features, among those we evaluated, were those based on the number of times a reference is mentioned in the body of a citing paper. The performance of these features inspired us to design an influence-primed h-index (the hip-index). Unlike the conventional h-index, it weights citations by how many times a reference is mentioned. According to our experiments, the hip-index is a better indicator of researcher performance than the conventional h-index.

Xiaodan Zhu Peter Turney Daniel Lemire André Vellino 10.1002/asi.23179 Journal of the Association for Information Science and Technology, 66: 408-427 http://arxiv.org/abs/1402.4466v3 2014-08-15T09:45:47-04:00 2014-02-18T15:41:33-05:00 Compressed bitmap indexes: beyond unions and intersections

Compressed bitmap indexes are used to speed up simple aggregate queries in databases. Indeed, set operations like intersections, unions and complements can be represented as logical operations (AND,OR,NOT) that are ideally suited for bitmaps. However, it is less obvious how to apply bitmaps to more advanced queries. For example, we might seek products in a store that meet some, but maybe not all, criteria. Such threshold queries generalize intersections and unions; they are often used in information-retrieval and data-mining applications. We introduce new algorithms that are sometimes three orders of magnitude faster than a naive approach. Our work shows that bitmap indexes are more broadly applicable than is commonly believed.

Owen Kaser Daniel Lemire 10.1002/spe.2289 Accepted for publication in Software: Practice and Experience on August 14th 2014. Note that arXiv:1402.4073 [cs:DB] is a companion to this paper; while they share some text, each contains many results not in the other Software: Practice & Experience 46 (2), 2016 http://arxiv.org/abs/1006.3726v4 2014-07-23T15:18:35-04:00 2010-06-18T11:38:04-04:00 Diamond Dicing

In OLAP, analysts often select an interesting sample of the data. For example, an analyst might focus on products bringing revenues of at least 100 000 dollars, or on shops having sales greater than 400 000 dollars. However, current systems do not allow the application of both of these thresholds simultaneously, selecting products and shops satisfying both thresholds. For such purposes, we introduce the diamond cube operator, filling a gap among existing data warehouse operations. Because of the interaction between dimensions the computation of diamond cubes is challenging. We compare and test various algorithms on large data sets of more than 100 million facts. We find that while it is possible to implement diamonds in SQL, it is inefficient. Indeed, our custom implementation can be a hundred times faster than popular database engines (including a row-store and a column-store).

Hazel Webb Daniel Lemire Owen Kaser 10.1016/j.datak.2013.01.001 29 pages Data & Knowledge Engineering, Volume 86, July 2013, Pages 1-18 http://arxiv.org/abs/1404.4963v2 2014-05-15T10:54:17-04:00 2014-04-19T11:46:01-04:00 Functional dependencies with null markers

Functional dependencies are an integral part of database design. However, they are only defined when we exclude null markers. Yet we commonly use null markers in practice. To bridge this gap between theory and practice, researchers have proposed definitions of functional dependencies over relations with null markers. Though sound, these definitions lack some qualities that we find desirable. For example, some fail to satisfy Armstrong's axioms---while these axioms are part of the foundation of common database methodologies. We propose a set of properties that any extension of functional dependencies over relations with null markers should possess. We then propose two new extensions having these properties. These extensions attempt to allow null markers where they make sense to practitioners. They both support Armstrong's axioms and provide realizable null markers: at any time, some or all of the null markers can be replaced by actual values without causing an anomaly. Our proposals may improve database designs.

Antonio Badia Daniel Lemire 10.1093/comjnl/bxu039 accepted at the Computer Journal (April 2014) Computer Journal 58 (5), 2015 http://arxiv.org/abs/cs/0610010v4 2014-02-04T11:15:57-05:00 2006-10-03T14:04:22-04:00 One-Pass, One-Hash n-Gram Statistics Estimation

In multimedia, text or bioinformatics databases, applications query sequences of n consecutive symbols called n-grams. Estimating the number of distinct n-grams is a view-size estimation problem. While view sizes can be estimated by sampling under statistical assumptions, we desire an unassuming algorithm with universally valid accuracy bounds. Most related work has focused on repeatedly hashing the data, which is prohibitive for large data sources. We prove that a one-pass one-hash algorithm is sufficient for accurate estimates if the hashing is sufficiently independent. To reduce costs further, we investigate recursive random hashing algorithms and show that they are sufficiently independent in practice. We compare our running times with exact counts using suffix arrays and show that, while we use hardly any storage, we are an order of magnitude faster. The approach further is extended to a one-pass/one-hash computation of n-gram entropy and iceberg counts. The experiments use a large collection of English text from the Gutenberg Project as well as synthetic data.

Daniel Lemire Owen Kaser Fixed a typo http://arxiv.org/abs/1207.2189v3 2014-02-03T13:46:09-05:00 2012-07-09T17:47:13-04:00 Reordering Rows for Better Compression: Beyond the Lexicographic Order

Sorting database tables before compressing them improves the compression rate. Can we do better than the lexicographical order? For minimizing the number of runs in a run-length encoding compression scheme, the best approaches to row-ordering are derived from traveling salesman heuristics, although there is a significant trade-off between running time and compression. A new heuristic, Multiple Lists, which is a variant on Nearest Neighbor that trades off compression for a major running-time speedup, is a good option for very large tables. However, for some compression schemes, it is more important to generate long runs rather than few runs. For this case, another novel heuristic, Vortex, is promising. We find that we can improve run-length encoding up to a factor of 3 whereas we can improve prefix coding by up to 80%: these gains are on top of the gains due to lexicographically sorting the table. We prove that the new row reordering is optimal (within 10%) at minimizing the runs of identical values within columns, in a few cases.

Daniel Lemire Owen Kaser Eduardo Gutarra 10.1145/2338626.2338627 to appear in ACM TODS ACM Trans. Database Syst. 37, 3, Article 20 (September 2012) http://arxiv.org/abs/1105.6001v4 2012-10-02T12:56:04-04:00 2011-05-30T10:07:10-04:00 A Call to Arms: Revisiting Database Design

Good database design is crucial to obtain a sound, consistent database, and - in turn - good database design methodologies are the best way to achieve the right design. These methodologies are taught to most Computer Science undergraduates, as part of any Introduction to Database class. They can be considered part of the "canon", and indeed, the overall approach to database design has been unchanged for years. Moreover, none of the major database research assessments identify database design as a strategic research direction. Should we conclude that database design is a solved problem? Our thesis is that database design remains a critical unsolved problem. Hence, it should be the subject of more research. Our starting point is the observation that traditional database design is not used in practice - and if it were used it would result in designs that are not well adapted to current environments. In short, database design has failed to keep up with the times. In this paper, we put forth arguments to support our viewpoint, analyze the root causes of this situation and suggest some avenues of research.

Antonio Badia Daniel Lemire 10.1145/2070736.2070750 Removed spurious column break. Nothing else was changed Antonio Badia and Daniel Lemire. A call to arms: revisiting database design. SIGMOD Record 40 (3), pages 61-69, 2011 http://arxiv.org/abs/1010.1526v6 2012-07-02T16:57:01-04:00 2010-10-07T15:48:23-04:00 Time Series Classification by Class-Specific Mahalanobis Distance Measures

To classify time series by nearest neighbors, we need to specify or learn one or several distance measures. We consider variations of the Mahalanobis distance measures which rely on the inverse covariance matrix of the data. Unfortunately --- for time series data --- the covariance matrix has often low rank. To alleviate this problem we can either use a pseudoinverse, covariance shrinking or limit the matrix to its diagonal. We review these alternatives and benchmark them against competitive methods such as the related Large Margin Nearest Neighbor Classification (LMNN) and the Dynamic Time Warping (DTW) distance. As we expected, we find that the DTW is superior, but the Mahalanobis distance measures are one to two orders of magnitude faster. To get best results with Mahalanobis distance measures, we recommend learning one distance measure per class using either covariance shrinking or the diagonal approach.

Zoltán Prekopcsák Daniel Lemire 10.1007/s11634-012-0110-6 Advances in Data Analysis and Classification 6 (3), 2012 http://arxiv.org/abs/1008.1715v5 2011-11-24T10:10:52-05:00 2010-08-10T10:05:28-04:00 The universality of iterated hashing over variable-length strings

Iterated hash functions process strings recursively, one character at a time. At each iteration, they compute a new hash value from the preceding hash value and the next character. We prove that iterated hashing can be pairwise independent, but never 3-wise independent. We show that it can be almost universal over strings much longer than the number of hash values; we bound the maximal string length given the collision probability.

Daniel Lemire 10.1016/j.dam.2011.11.009 Discrete Applied Mathematics 160 (4-5), 604--617 (2012) http://arxiv.org/abs/1108.4041v2 2011-08-22T22:21:59-04:00 2011-08-19T16:16:02-04:00 Extracting, Transforming and Archiving Scientific Data

It is becoming common to archive research datasets that are not only large but also numerous. In addition, their corresponding metadata and the software required to analyse or display them need to be archived. Yet the manual curation of research data can be difficult and expensive, particularly in very large digital repositories, hence the importance of models and tools for automating digital curation tasks. The automation of these tasks faces three major challenges: (1) research data and data sources are highly heterogeneous, (2) future research needs are difficult to anticipate, (3) data is hard to index. To address these problems, we propose the Extract, Transform and Archive (ETA) model for managing and mechanizing the curation of research data. Specifically, we propose a scalable strategy for addressing the research-data problem, ranging from the extraction of legacy data to its long-term storage. We review some existing solutions and propose novel avenues of research.

Daniel Lemire Andre Vellino 8 pages, Fourth Workshop on Very Large Digital Libraries, 2011 http://arxiv.org/abs/0909.1346v8 2011-02-22T11:00:56-05:00 2009-09-07T17:34:52-04:00 Reordering Columns for Smaller Indexes

Column-oriented indexes-such as projection or bitmap indexes-are compressed by run-length encoding to reduce storage and increase speed. Sorting the tables improves compression. On realistic data sets, permuting the columns in the right order before sorting can reduce the number of runs by a factor of two or more. Unfortunately, determining the best column order is NP-hard. For many cases, we prove that the number of runs in table columns is minimized if we sort columns by increasing cardinality. Experimentally, sorting based on Hilbert space-filling curves is poor at minimizing the number of runs.

Daniel Lemire Owen Kaser 10.1016/j.ins.2011.02.002 to appear in Information Sciences Daniel Lemire and Owen Kaser, Reordering Columns for Smaller Indexes, Information Sciences 181 (12), 2011 http://arxiv.org/abs/0811.3301v2 2009-06-10T13:37:32-04:00 2008-11-20T11:22:05-05:00 Faster Retrieval with a Two-Pass Dynamic-Time-Warping Lower Bound

Daniel Lemire 10.1016/j.patcog.2008.11.030 Accepted in Pattern Recognition on November 20th, 2008 Daniel Lemire, Faster Retrieval with a Two-Pass Dynamic-Time-Warping Lower Bound, Pattern Recognition 42(9): 2169-2180 (2009) http://arxiv.org/abs/0906.0910v1 2009-06-04T09:28:51-04:00 2009-06-04T09:28:51-04:00 On the Challenges of Collaborative Data Processing

The last 30 years have seen the creation of a variety of electronic collaboration tools for science and business. Some of the best-known collaboration tools support text editing (e.g., wikis). Wikipedia's success shows that large-scale collaboration can produce highly valuable content. Meanwhile much structured data is being collected and made publicly available. We have never had access to more powerful databases and statistical packages. Is large-scale collaborative data analysis now possible? Using a quantitative analysis of Web 2.0 data visualization sites, we find evidence that at least moderate open collaboration occurs. We then explore some of the limiting factors of collaboration over data.

Sylvie Noel Daniel Lemire to appear as a chapter in an upcoming book (Collaborative Information Behavior) http://arxiv.org/abs/0808.2083v3 2009-01-19T10:22:33-05:00 2008-08-14T23:14:55-04:00 Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

Bitmap indexes must be compressed to reduce input/output costs and minimize CPU usage. To accelerate logical operations (AND, OR, XOR) over bitmaps, we use techniques based on run-length encoding (RLE), such as Word-Aligned Hybrid (WAH) compression. These techniques are sensitive to the order of the rows: a simple lexicographical sort can divide the index size by 9 and make indexes several times faster. We investigate reordering heuristics based on computed attribute-value histograms. Simply permuting the columns of the table based on these histograms can increase the sorting efficiency by 40%.

Owen Kaser Daniel Lemire Kamel Aouiche To appear in proceedings of DOLAP 2008 http://arxiv.org/abs/cs/0703058v3 2008-12-08T14:26:17-05:00 2007-03-13T13:46:11-04:00 A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP

A data warehouse cannot materialize all possible views, hence we must estimate quickly, accurately, and reliably the size of views to determine the best candidates for materialization. Many available techniques for view-size estimation make particular statistical assumptions and their error can be large. Comparatively, unassuming probabilistic techniques are slower, but they estimate accurately and reliability very large view sizes using little memory. We compare five unassuming hashing-based view-size estimation techniques including Stochastic Probabilistic Counting and LogLog Probabilistic Counting. Our experiments show that only Generalized Counting, Gibbons-Tirthapura, and Adaptive Counting provide universally tight estimates irrespective of the size of the view; of those, only Adaptive Counting remains constantly fast as we increase the memory budget.

Kamel Aouiche Daniel Lemire http://arxiv.org/abs/cs/0703056v3 2008-12-08T14:23:44-05:00 2007-03-12T17:42:59-04:00 Unasssuming View-Size Estimation Techniques in OLAP

Even if storage was infinite, a data warehouse could not materialize all possible views due to the running time and update requirements. Therefore, it is necessary to estimate quickly, accurately, and reliably the size of views. Many available techniques make particular statistical assumptions and their error can be quite large. Unassuming techniques exist, but typically assume we have independent hashing for which there is no known practical implementation. We adapt an unassuming estimator due to Gibbons and Tirthapura: its theoretical bounds do not make unpractical assumptions. We compare this technique experimentally with stochastic probabilistic counting, LogLog probabilistic counting, and multifractal statistical models. Our experiments show that we can reliably and accurately (within 10%, 19 times out 20) estimate view sizes over large data sets (1.5 GB) within minutes, using almost no memory. However, only Gibbons-Tirthapura provides universally tight estimates irrespective of the size of the view. For large views, probabilistic counting has a small edge in accuracy, whereas the competitive sampling-based method (multifractal) we tested is an order of magnitude faster but can sometimes provide poor estimates (relative error of 100%). In our tests, LogLog probabilistic counting is not competitive. Experimental validation on the US Census 1990 data set and on the Transaction Processing Performance (TPC H) data set is provided.

Kamel Aouiche Daniel Lemire Published in ICEIS 2007 http://arxiv.org/abs/0807.1734v5 2008-10-06T20:45:02-04:00 2008-07-10T16:26:31-04:00 Faster Sequential Search with a Two-Pass Dynamic-Time-Warping Lower Bound

The Dynamic Time Warping (DTW) is a popular similarity measure between time series. The DTW fails to satisfy the triangle inequality and its computation requires quadratic time. Hence, to find closest neighbors quickly, we use bounding techniques. We can avoid most DTW computations with an inexpensive lower bound (LB_Keogh). We compare LB_Keogh with a tighter lower bound (LB_Improved). We find that LB_Improved-based search is faster for sequential search. As an example, our approach is 3 times faster over random-walk and shape time series. We also review some of the mathematical properties of the DTW. We derive a tight triangle inequality for the DTW. We show that the DTW becomes the l_1 distance when time series are separated by a constant.

Daniel Lemire http://arxiv.org/abs/cs/0702144v2 2008-09-15T15:57:44-04:00 2007-02-23T22:16:27-05:00 Slope One Predictors for Online Rating-Based Collaborative Filtering

Rating-based collaborative filtering is the process of predicting how a user would rate a given item from other user ratings. We propose three related slope one schemes with predictors of the form f(x) = x + b, which precompute the average difference between the ratings of one item and another for users who rated both. Slope one algorithms are easy to implement, efficient to query, reasonably accurate, and they support both online queries and dynamic updates, which makes them good candidates for real-world systems. The basic slope one scheme is suggested as a new reference scheme for collaborative filtering. By factoring in items that a user liked separately from items that a user disliked, we achieve results competitive with slower memory-based schemes over the standard benchmark EachMovie and Movielens data sets while better fulfilling the desiderata of CF applications.

Daniel Lemire Anna Maclachlan In SIAM Data Mining (SDM'05), Newport Beach, California, April 21-23, 2005 http://arxiv.org/abs/0805.3339v3 2008-08-14T20:08:42-04:00 2008-05-21T15:50:46-04:00 Tri de la table de faits et compression des index bitmaps avec alignement sur les mots

Bitmap indexes are frequently used to index multidimensional data. They rely mostly on sequential input/output. Bitmaps can be compressed to reduce input/output costs and minimize CPU usage. The most efficient compression techniques are based on run-length encoding (RLE), such as Word-Aligned Hybrid (WAH) compression. This type of compression accelerates logical operations (AND, OR) over the bitmaps. However, run-length encoding is sensitive to the order of the facts. Thus, we propose to sort the fact tables. We review lexicographic, Gray-code, and block-wise sorting. We found that a lexicographic sort improves compression--sometimes generating indexes twice as small--and make indexes several times faster. While sorting takes time, this is partially offset by the fact that it is faster to index a sorted table. Column order is significant: it is generally preferable to put the columns having more distinct values at the beginning. A block-wise sort is much less efficient than a full sort. Moreover, we found that Gray-code sorting is not better than lexicographic sorting when using word-aligned compression.

Kamel Aouiche Daniel Lemire Owen Kaser to appear at BDA'08 http://arxiv.org/abs/0805.0747v1 2008-05-06T11:45:15-04:00 2008-05-06T11:45:15-04:00 Pruning Attribute Values From Data Cubes with Diamond Dicing

Data stored in a data warehouse are inherently multidimensional, but most data-pruning techniques (such as iceberg and top-k queries) are unidimensional. However, analysts need to issue multidimensional queries. For example, an analyst may need to select not just the most profitable stores or--separately--the most profitable products, but simultaneous sets of stores and products fulfilling some profitability constraints. To fill this need, we propose a new operator, the diamond dice. Because of the interaction between dimensions, the computation of diamonds is challenging. We present the first diamond-dicing experiments on large data sets. Experiments show that we can compute diamond cubes over fact tables containing 100 million facts in less than 35 minutes using a standard PC.

Hazel Webb Owen Kaser Daniel Lemire http://arxiv.org/abs/0709.1166v1 2007-09-07T17:18:57-04:00 2007-09-07T17:18:57-04:00 An Optimal Linear Time Algorithm for Quasi-Monotonic Segmentation

Monotonicity is a simple yet significant qualitative characteristic. We consider the problem of segmenting a sequence in up to K segments. We want segments to be as monotonic as possible and to alternate signs. We propose a quality metric for this problem using the l_inf norm, and we present an optimal linear time algorithm based on novel formalism. Moreover, given a precomputation in time O(n log n) consisting of a labeling of all extrema, we compute any optimal segmentation in constant time. We compare experimentally its performance to two piecewise linear segmentation heuristics (top-down and bottom-up). We show that our algorithm is faster and more accurate. Applications include pattern recognition and qualitative modeling.

Daniel Lemire Martin Brooks Yuhong Yan 10.1080/00207160701694153 This is the extended version of our ICDM'05 paper (arXiv:cs/0702142) Daniel Lemire, Martin Brooks and Yuhong Yan, An Optimal Linear Time Algorithm for Quasi-Monotonic Segmentation. International Journal of Computer Mathematics 86 (7), 2009. http://arxiv.org/abs/cs/0703109v2 2007-05-07T13:59:20-04:00 2007-03-22T10:54:48-04:00 Tag-Cloud Drawing: Algorithms for Cloud Visualization

Tag clouds provide an aggregate of tag-usage statistics. They are typically sent as in-line HTML to browsers. However, display mechanisms suited for ordinary text are not ideal for tags, because font sizes may vary widely on a line. As well, the typical layout does not account for relationships that may be known between tags. This paper presents models and algorithms to improve the display of tag clouds that consist of in-line HTML, as well as algorithms that use nested tables to achieve a more general 2-dimensional layout in which tag relationships are considered. The first algorithms leverage prior work in typesetting and rectangle packing, whereas the second group of algorithms leverage prior work in Electronic Design Automation. Experiments show our algorithms can be efficiently implemented and perform well.

Owen Kaser Daniel Lemire To appear in proceedings of Tagging and Metadata for Social Information Organization (WWW 2007) http://arxiv.org/abs/cs/0605103v8 2007-04-28T19:13:34-04:00 2006-05-24T00:42:53-04:00 A Better Alternative to Piecewise Linear Time Series Segmentation

Time series are difficult to monitor, summarize and predict. Segmentation organizes time series into few intervals having uniform characteristics (flatness, linearity, modality, monotonicity and so on). For scalability, we require fast linear time algorithms. The popular piecewise linear model can determine where the data goes up or down and at what rate. Unfortunately, when the data does not follow a linear model, the computation of the local slope creates overfitting. We propose an adaptive time series model where the polynomial degree of each interval vary (constant, linear and so on). Given a number of regressors, the cost of each interval is its polynomial degree: constant intervals cost 1 regressor, linear intervals cost 2 regressors, and so on. Our goal is to minimize the Euclidean (l_2) error for a given model complexity. Experimentally, we investigate the model where intervals can be either constant or linear. Over synthetic random walks, historical stock market prices, and electrocardiograms, the adaptive model provides a more accurate segmentation than the piecewise linear model without increasing the cross-validation error or the running time, while providing a richer vocabulary to applications. Implementation issues, such as numerical stability and real-world performance, are discussed.

Daniel Lemire to appear in SIAM Data Mining 2007 http://arxiv.org/abs/cs/0610046v5 2007-03-21T21:19:11-04:00 2006-10-09T18:09:42-04:00 Streaming Maximum-Minimum Filter Using No More than Three Comparisons per Element

The running maximum-minimum (max-min) filter computes the maxima and minima over running windows of size w. This filter has numerous applications in signal processing and time series analysis. We present an easy-to-implement online algorithm requiring no more than 3 comparisons per element, in the worst case. Comparatively, no algorithm is known to compute the running maximum (or minimum) filter in 1.5 comparisons per element, in the worst case. Our algorithm has reduced latency and memory usage.

Daniel Lemire to appear in Nordic Journal of Computing Daniel Lemire, Streaming Maximum-Minimum Filter Using No More than Three Comparisons per Element, Nordic Journal of Computing, Volume 13, Number 4, pages 328-339, 2006 http://arxiv.org/abs/cs/0702143v1 2007-02-23T22:11:09-05:00 2007-02-23T22:11:09-05:00 Attribute Value Reordering For Efficient Hybrid OLAP

The normalization of a data cube is the ordering of the attribute values. For large multidimensional arrays where dense and sparse chunks are stored differently, proper normalization can lead to improved storage efficiency. We show that it is NP-hard to compute an optimal normalization even for 1x3 chunks, although we find an exact algorithm for 1x2 chunks. When dimensions are nearly statistically independent, we show that dimension-wise attribute frequency sorting is an optimal normalization and takes time O(d n log(n)) for data cubes of size n^d. When dimensions are not independent, we propose and evaluate several heuristics. The hybrid OLAP (HOLAP) storage mechanism is already 19%-30% more efficient than ROLAP, but normalization can improve it further by 9%-13% for a total gain of 29%-44% over ROLAP.

Owen Kaser Daniel Lemire 10.1016/j.ins.2005.09.005 Owen Kaser, Daniel Lemire, Attribute Value Reordering For Efficient Hybrid OLAP, Information Sciences, Volume 176, Issue 16, 2006, Pages 2304-2336 http://arxiv.org/abs/cs/0702142v1 2007-02-23T21:29:36-05:00 2007-02-23T21:29:36-05:00 An Optimal Linear Time Algorithm for Quasi-Monotonic Segmentation

Monotonicity is a simple yet significant qualitative characteristic. We consider the problem of segmenting an array in up to K segments. We want segments to be as monotonic as possible and to alternate signs. We propose a quality metric for this problem, present an optimal linear time algorithm based on novel formalism, and compare experimentally its performance to a linear time top-down regression algorithm. We show that our algorithm is faster and more accurate. Applications include pattern recognition and qualitative modeling.

Daniel Lemire Martin Brooks Yuhong Yan Appeared in ICDM 2005 http://arxiv.org/abs/math/0701481v2 2007-01-24T16:01:45-05:00 2007-01-17T10:01:09-05:00 Monotonicity Analysis over Chains and Curves

Chains are vector-valued signals sampling a curve. They are important to motion signal processing and to many scientific applications including location sensors. We propose a novel measure of smoothness for chains curves by generalizing the scalar-valued concept of monotonicity. Monotonicity can be defined by the connectedness of the inverse image of balls. This definition is coordinate-invariant and can be computed efficiently over chains. Monotone curves can be discontinuous, but continuous monotone curves are differentiable a.e. Over chains, a simple sphere-preserving filter shown to never decrease the degree of monotonicity. It outperforms moving average filters over a synthetic data set. Applications include Time Series Segmentation, chain reconstruction from unordered data points, Optical Character Recognition, and Pattern Matching.

Dan Kucerovsky Daniel Lemire to appear in Proceedings of Curves and Surfaces 2006 http://arxiv.org/abs/cs/0605127v1 2006-05-26T20:51:46-04:00 2006-05-26T20:51:46-04:00 Analyzing Large Collections of Electronic Text Using OLAP

Computer-assisted reading and analysis of text has various applications in the humanities and social sciences. The increasing size of many electronic text archives has the advantage of a more complete analysis but the disadvantage of taking longer to obtain results. On-Line Analytical Processing is a method used to store and quickly analyze multidimensional data. By storing text analysis information in an OLAP system, a user can obtain solutions to inquiries in a matter of seconds as opposed to minutes, hours, or even days. This analysis is user-driven allowing various users the freedom to pursue their own direction of research.

Steven Keith Owen Kaser Daniel Lemire