Data – Radar

Signals for 2026

Julie Baron — Fri, 09 Jan 2026 12:14:20 +0000

We’re three years into a post-ChatGPT world, and AI remains the focal point of the tech industry. In 2025, several ongoing trends intensified: AI investment accelerated; enterprises integrated agents and workflow automation at a faster pace; and the toolscape for professionals seeking a career edge is now overwhelmingly expansive. But the jury’s still out on the ROI from the vast sums that have saturated the industry.

We anticipate that 2026 will be a year of increased accountability. Expect enterprises to shift focus from experimentation to measurable business outcomes and sustainable AI costs. There are promising productivity and efficiency gains to be had in software engineering and development, operations, security, and product design, but significant challenges also persist.

Bigger picture, the industry is still grappling with what AI is and where we’re headed. Is AI a worker that will take all our jobs? Is AGI imminent? Is the bubble about to burst? Economic uncertainty, layoffs, and shifting AI hiring expectations have undeniably created stark career anxiety throughout the industry. But as Tim O’Reilly pointedly argues, “AI is not taking jobs: The decisions of people deploying it are.” No one has quite figured out how to make money yet, but the organizations that succeed will do so by creating solutions that “genuinely improve. . .customers’ lives.” That won’t happen by shoehorning AI into existing workflows but by first determining where AI can actually improve upon them, then taking an “AI first” approach to developing products around these insights.

As Tim O’Reilly and Mike Loukides recently explained, “At O’Reilly, we don’t believe in predicting the future. But we do believe you can see signs of the future in the present.” We’re watching a number of “possible futures taking shape.” AI will undoubtedly be integrated more deeply into industries, products, and the wider workforce in 2026 as use cases continue to be discovered and shared. Topics we’re keeping tabs on include context engineering for building more reliable, performant AI systems; LLM posttraining techniques, in particular fine-tuning as a means to build more specialized, domain-specific models; the growth of agents, as well as the protocols, like MCP, to support them; and computer vision and multimodal AI more generally to enable the development of physical/embodied AI and the creation of world models.

Here are some of the other trends that are pointing the way forward.

Software Development

In 2025, AI was embedded in software developers’ everyday work, transforming their roles—in some cases dramatically. A multitude of AI tools are now available to create code, and workflows are undergoing a transformation shaped by new concepts including vibe coding, agentic development, context engineering, eval- and spec-driven development, and more.

In 2026, we’ll see an increased focus on agents and the protocols, like MCP, that support them; new coding workflows; and the impact of AI on assisting with legacy code. But even as software development practices evolve, fundamental skills such as code review, design patterns, debugging, testing, and documentation are as vital as ever.

And despite major disruption from GenAI, programming languages aren’t going anywhere. Type-safe languages like TypeScript, Java, and C# provide compile-time validation that catches AI errors before production, helping mitigate the risks of AI-generated code. Memory safety mandates will drive interest in Rust and Zig for systems programming: Major players such as Google, Microsoft, Amazon, and Meta have adopted Rust for critical systems, and Zig is behind Anthropic’s most recent acquisition, Bun. And Python is central to creating powerful AI and machine learning frameworks, driving complex intelligent automation that extends far beyond simple scripting. It’s also ideal for edge computing and robotics, two areas where AI is likely to make inroads in the coming year.

Takeaways

Which AI tools programmers use matter less than how they use them. With a wide choice of tools now available in the IDE and on the command line, and new options being introduced all the time, it’s useful to focus on the skills needed to produce good code rather than focusing on the tool itself. After all, whatever tool they use, developers are ultimately responsible for the code it produces.

Effectively communicating with AI models is the key to doing good work. The more background AI tools are given about a project, the better the code they generate will be. Developers have to understand both how to manage what the AI knows about their project (context engineering) and how to communicate it (prompt engineering) to get useful outputs.

AI isn’t just a pair programmer; it’s an entire team of developers. Software engineers have moved beyond single coding assistants. They’re building and deploying custom agents, often within complex setups involving multi-agent scenarios, teams of coding agents, and agent swarms. But as the engineering workflow shifts from conducting AI to orchestrating AI, the fundamentals of building and maintaining good software—code review, design patterns, debugging, testing, and documentation—stay the same and will be what elevates purposeful AI-assisted code above the crowd.

Software Architecture

AI has progressed from being something architects might have to consider to something that is now essential to their work. They can use LLMs to accelerate or optimize architecture tasks; they can add AI to existing software systems or use it to modernize those systems; and they can design AI-native architectures, an approach that requires new considerations and patterns for system design. And even if they aren’t working with AI (yet), architects still need to understand how AI relates to other parts of their system and be able to communicate their decisions to stakeholders at all levels.

Takeaways

AI-enhanced and AI-native architectures bring new considerations and patterns for system design. Event-driven models can enable AI agents to act on incoming triggers rather than fixed prompts. In 2026, evolving architectures will become more important as architects look for ways to modernize existing systems for AI. And the rise of agentic AI means architects need to stay up-to-date on emerging protocols like MCP.

Many of the concerns from 2025 will carry over into the new year. Considerations such as incorporating LLMs and RAG into existing architectures, emerging architecture patterns and antipatterns specifically for AI systems, and the focus on API and data integrations elevated by MCP are critical.

The fundamentals still matter. Tools and frameworks are making it possible to automate more tasks. However, to successfully leverage these capabilities to design sustainable architecture, enterprise architects must have a full command of the principles behind them: when to add an agent or a microservice, how to consider cost, how to define boundaries, and how to act on the knowledge they already have.

Infrastructure and Operations

The InfraOps space is undergoing its most significant transformation since cloud computing, as AI evolves from a workload to be managed to an active participant in managing infrastructure itself. With infrastructure sprawling across multicloud environments, edge deployments, and specialized AI accelerators, manual management is becoming nearly impossible. In 2026, the industry will keep moving toward self-healing systems and predictive observability—infrastructure that continuously optimizes itself, shifting the human role from manual maintenance to system oversight, architecture, and long-term strategy.

Platform engineering makes this transformation operational, abstracting infrastructure complexity behind self-service interfaces, which lets developers deploy AI workloads, implement observability, and maintain security without deep infrastructure expertise. The best platforms will evolve into orchestration layers for autonomous systems. While fully autonomous systems remain on the horizon, the trajectory is clear.

Takeaways

AI is becoming a primary driver of infrastructure architecture. AI-native workloads demand GPU orchestration at scale, specialized networking protocols optimized for model training and inference, and frameworks like Ray on Kubernetes that can distribute compute intelligently. Organizations are redesigning infrastructure stacks to accommodate these demands and are increasingly considering hybrid environments and alternatives to hyperscalers to power their AI workloads—“neocloud” platforms like CoreWeave, Lambda, and Vultr.

AI is augmenting the work of operations teams with real-time intelligence. Organizations are turning to AIOps platforms to predict failures before they cascade, identify anomalies humans would miss, and surface optimization opportunities in telemetry data. These systems aim to amplify human judgment, giving operators superhuman pattern recognition across complex environments.

AI is evolving into an autonomous operator that makes its own infrastructure decisions. Companies will implement emerging “agentic SRE” practices: systems that reason about infrastructure problems, form hypotheses about root causes, and take independent corrective action, replicating the cognitive workload that SREs perform, not just following predetermined scripts.

Data

The big story of the back half of 2025 was agents. While the groundwork has been laid, in 2026 we expect focus on the development of agentic systems to persist—and this will necessitate new tools and techniques, particularly on the data side. AI and data platforms continue to converge, with vendors like Snowflake, Databricks, and Salesforce releasing products to help customers build and deploy agents.

Beyond agents, AI is making its influence felt across the entire data stack, as data professionals target their workflows to support enterprise AI. Significant trends include real-time analytics, enhanced data privacy and security, and the increasing use of low-code/no-code tools to democratize data access. Sustainability also remains a concern, and data professionals need to consider ESG compliance, carbon-aware tooling, and resource-optimized architectures when designing for AI workloads.

Takeaways

Data infrastructure continues to consolidate. The consolidation trend has not only affected the modern data stack but also more traditional areas like the database space. In response, organizations are being more intentional about what kind of databases they deploy. At the same time, modern data stacks have fragmented across cloud platforms and open ecosystems, so engineers must increasingly design for interoperability.

A multiple database approach is more important than ever. Vector databases like Pinecone, Milvus, Qdrant, and Weaviate help power agentic AI—while they’re a new technology, companies are beginning to adopt vector databases more widely. DuckDB’s popularity is growing for running analytical queries. And even though it’s been around for a while, ClickHouse, an open source distributed OLAP database used for real-time analytics, has finally broken through with data professionals.

The infrastructure to support autonomous agents is coming together. GitOps, observability, identity management, and zero-trust orchestration will all play key roles. And we’re following a number of new initiatives that facilitate agentic development, including AgentDB, a database designed specifically to work effectively with AI agents; Databricks’ recently announced Lakebase, a Postgres database/OLTP engine integrated within the data lakehouse; and Tiger Data’s Agentic Postgres, a database “designed from the ground up” to support agents.

Security

AI is a threat multiplier—59% of tech professionals cited AI-driven cyberthreats as their biggest concern in a recent survey. In response, the cybersecurity analyst role is shifting from low-level human-in-the-loop tasks to complex threat hunting, AI governance, advanced data analysis and coding, and human-AI teaming oversight. But addressing AI-generated threats will also require a fundamental transformation in defensive strategy and skill acquisition—and the sooner it happens, the better.

Takeaways

Security professionals now have to defend a broader attack surface. The proliferation of AI agents expands the attack surface. Security tools must evolve to protect it. Implementing zero trust for machine identities is a smart opening move to mitigate sprawl and nonhuman traffic. Security professionals must also harden their AI systems against common threats such as prompt injection and model manipulation.

Organizations are struggling with governance and compliance. Striking a balance between data utility and vulnerability requires adherence to data governance best practices (e.g., least privilege). Government agencies, industry and professional groups, and technology companies are developing a range of AI governance frameworks to help guide organizations, but it’s up to companies to translate these technical governance frameworks into board-level risk decisions and actionable policy controls.

The security operations center (SOC) is evolving. The velocity and scale of AI-driven attacks can overwhelm traditional SIEM/SOAR solutions. Expect increased adoption of agentic SOC—a system of specialized, coordinated AI agents for triage and response. This shifts the focus of the SOC analyst from reactive alert triage to proactive threat hunting, complex analysis, and AI system oversight.

Product Management and Design

Business focus in 2025 shifted from scattered AI experiments to the challenge of building defensible, AI-native businesses. Next year we’re likely to see product teams moving from proof of concept to proof of value.

One thing to look for: Design and product responsibilities may consolidate under a “product builder”—a full stack generalist in product, design, and engineering who can rapidly build, validate, and launch new products. Companies are currently hiring for this role, although few people actually possess the full skill set at the moment. But regardless of whether product builders become ascendant, product folks in 2026 and beyond will need the ability to combine product validation, good-enough engineering, and rapid design, all enabled by AI as a core accelerator. We’re already seeing the “product manager” role becoming more technical as AI spreads throughout the product development process. Nearly all PMs use AI, but they’ll increasingly employ purpose-built AI workflows for research, user-testing, data analysis, and prototyping.

Takeaways

Companies need to bridge the AI product strategy gap. Most companies have moved past simple AI experiments but are now facing a strategic crisis. Their existing product playbooks (how to size markets, roadmapping, UX) weren’t designed for AI-native products. Organizations must develop clear frameworks for building a portfolio of differentiated AI products, managing new risks, and creating sustainable value.

AI product evaluation is now mission-critical. As AI becomes a core product component and strategy matures, rigorous evaluation is the key to turning products that are good on paper into those that are great in production. Teams should start by defining what “good” means for their specific context, then build reliable evals for models, agents, and conversational UIs to ensure they’re hitting that target.

Design’s new frontier is conversations and interactions. Generative AI has pushed user experience beyond static screens into probabilistic new multimodal territory. This means a harder shift toward designing nonlinear, conversational systems, including AI agents. In 2026, we’re likely to see increased demand for AI conversational designers and AI interaction designers to devise conversation flows for chatbots and even design a model’s behavior and personality.

What It All Means

While big questions about AI remain unanswered, the best way to plan for uncertainty is to consider the real value you can create for your users and for your teams themselves right now. The tools will improve, as they always do, and the strategies to use them will grow more complex. Being deeply versed in the core knowledge of your area of expertise gives you the foundation you’ll need to take advantage of these quickly evolving technologies—and ensure that whatever you create will be built on bedrock, not shaky ground.

AI, MCP, and the Hidden Costs of Data Hoarding

Andrew Stellman — Mon, 15 Dec 2025 13:15:24 +0000

The Model Context Protocol (MCP) is genuinely useful. It gives people who develop AI tools a standardized way to call functions and access data from external systems. Instead of building custom integrations for each data source, you can expose databases, APIs, and internal tools through a common protocol that any AI can understand.

However, I’ve been watching teams adopt MCP over the past year, and I’m seeing a disturbing pattern. Developers are using MCP to quickly connect their AI assistants to every data source they can find—customer databases, support tickets, internal APIs, document stores—and dumping it all into the AI’s context. And because the AI is smart enough to sort through a massive blob of data and pick out the parts that are relevant, it all just works! Which, counterintuitively, is actually a problem. The AI cheerfully processes massive amounts of data and produces reasonable answers, so nobody even thinks to question the approach.

This is data hoarding. And like physical hoarders who can’t throw anything away until their homes become so cluttered they’re unliveable, data hoarding has the potential to cause serious problems for our teams. Developers learn they can fetch far more data than the AI needs and provide it with little planning or structure, and the AI is smart enough to deal with it and still give good results.

When connecting a new data source takes hours instead of days, many developers don’t take the time to ask what data actually belongs in the context. That’s how you end up with systems that are expensive to run and impossible to debug, while an entire cohort of developers misses the chance to learn the critical data architecture skills they need to build robust and maintainable applications.

How Teams Learn to Hoard

Anthropic released MCP in late 2024 to give developers a universal way to connect AI assistants to their data. Instead of maintaining separate code for connectors to let AI access data from, say, S3, OneDrive, Jira, ServiceNow, and your internal DBs and APIs, you use the same simple protocol to provide the AI with all sorts of data to include in its context. It quickly gained traction. Companies like Block and Apollo adopted it, and teams everywhere started using it. The promise is real; in many cases, the work of connecting data sources to AI agents that used to take weeks can now take minutes. But that speed can come at a cost.

Let’s start with an example: a small team working on an AI tool that reads customer support tickets, categorizes them by urgency, suggests responses, and routes them to the right department. They needed to get something working quickly but faced a challenge: They had customer data spread across multiple systems. After spending a morning arguing about what data to pull, which fields were necessary, and how to structure the integration, one developer decided to just build it, creating a single getCustomerData(customerId) MCP tool that pulls everything they’d discussed—40 fields from three different systems—into one big response object. To the team’s relief, it worked! The AI happily consumed all 40 fields and started answering questions, and no more discussions or decisions were needed. The AI handled all the new data just fine, and everyone felt like the project was on the right track.

Day two, someone added order history so the assistant could explain refunds. Soon the tool pulled Zendesk status, CRM status, eligibility flags that contradicted each other, three different name fields, four timestamps for “last seen,” plus entire conversation threads, and combined them all into an ever-growing data object.

The assistant kept producing reasonable-looking answers, even as the data it ingested kept growing in scale. However, the model now had to wade through thousands of irrelevant tokens before answering simple questions like “Is this customer eligible for a refund?” The team ended up with a data architecture that buried the signal in noise. That additional load put stress on the AI to dig out that signal, leading to serious potential long-term problems. But they didn’t realize it yet, because the AI kept producing reasonable-looking answers. As they added more data sources over the following weeks, the AI started taking longer to respond. Hallucinations crept in that they couldn’t track down to any specific data source. What had been a really valuable tool became a bear to maintain.

The team had fallen into the data hoarding trap: Their early quick wins created a culture where people just threw whatever they needed into the context, and eventually it grew into a maintenance nightmare that only got worse as they added more data sources.

The Skills That Never Develop

There are as many opinions on data architecture as there are developers, and there are usually many ways to solve any one problem. One thing that almost everyone agrees on is that it takes careful choices and lots of experience. But it’s also the subject of lots of debate, especially within teams, precisely because there are so many ways to design how your application stores, transmits, encodes, and uses data.

Most of us fall into just-in-case thinking at one time or another, especially early in our careers—pulling all the data we might possibly need just in case we need it rather than fetching only what we need when we actually need it (which is an example of the opposite, just-in-time thinking). Normally when we’re designing our data architecture, we’re dealing with immediate constraints: ease of access, size, indexing, performance, network latency, and memory usage. But when we use MCP to provide data to an AI, we can often sidestep many of those trade-offs…temporarily.

The more we work with data, the better we get at designing how our apps use it. The more early-career developers are exposed to it, the more they learn through experience why, for example, System A should own customer status while System B owns payment history. Healthy debate is an important part of this learning process. Through all of these experiences, we develop an intuition for what “too much data” looks like—and how to handle all of those tricky but critical trade-offs that create friction throughout our projects.

MCP can remove the friction that comes from those trade-offs by letting us avoid having to make those decisions at all. If a developer can wire up everything in just a few minutes, there’s no need for discussion or debate about what’s actually needed. The AI seems to handle whatever data you throw at it, so the code ships without anyone questioning the design.

Without all of that experience making, discussing, and debating data design choices, developers miss the chance to build critical mental models about data ownership, system boundaries, and the cost of moving unnecessary data around. They spend their formative years connecting instead of architecting. This is another example of what I call the cognitive shortcut paradox—AI tools that make development easier can prevent developers from building the very skills they need to use those tools effectively. Developers who rely solely on MCP to handle messy data never learn to recognize when data architecture is problematic, just like developers who rely solely on tools like Copilot or Claude Code to generate code never learn to debug what it creates.

The Hidden Costs of Data Hoarding

Teams use MCP because it works. Many teams carefully plan their MCP data architecture, and even teams that do fall into the data hoarding trap still ship successful products. But MCP is still relatively new, and the hidden costs of data hoarding take time to surface.

Teams often don’t discover the problems with a data hoarding approach until they need to scale their applications. That bloated context that barely registered as a cost for your first hundred queries starts showing up as a real line item in your cloud bill when you’re handling millions of requests. Every unnecessary field you’re passing to the AI adds up, and you’re paying for all that redundant data on every single AI call.

Any developer who’s dealt with tightly coupled classes knows that when something goes wrong—and it always does, eventually—it’s a lot harder to debug. You often end up dealing with shotgun surgery, that really unpleasant situation where fixing one small problem requires changes that cascade across multiple parts of your codebase. Hoarded data creates the same kind of technical debt in your AI systems: When the AI gives a wrong answer, tracking down which field it used or why it trusted one system over another is difficult, often impossible.

There’s also a security dimension to data hoarding that teams often miss. Every piece of data you expose through an MCP tool is a potential vulnerability. If an attacker finds an unprotected endpoint, they can pull everything that tool provides. If you’re hoarding data, that’s your entire customer database instead of just the three fields actually needed for the task. Teams that fall into the data hoarding trap find themselves violating the principle of least privilege: Applications should have access to the data they need, but no more. That can bring an enormous security risk to their whole organization.

In an extreme case of data hoarding infecting an entire company, you might discover that every team in your organization is building their own blob. Support has one version of customer data, sales has another, product has a third. The same customer looks completely different depending on which AI assistant you ask. New teams come along, see what appears to be working, and copy the pattern. Now you’ve got data hoarding as organizational culture.

Each team thought they were being pragmatic, shipping fast, and avoiding unnecessary arguments about data architecture. But the hoarding pattern spreads through an organization the same way technical debt spreads through a codebase. It starts small and manageable. Before you know it, it’s everywhere.

Practical Tools for Avoiding the Data Hoarding Trap

It can be really difficult to coach a team away from data hoarding when they’ve never experienced the problems it causes. Developers are very practical—they want to see evidence of problems and aren’t going to sit through abstract discussions about data ownership and system boundaries when everything they’ve done so far has worked just fine.

In Learning Agile, Jennifer Greene and I wrote about how teams resist change because they know that what they’re doing today works. To the person trying to get developers to change, it may seem like irrational resistance, but it’s actually pretty rational to push back against someone from the outside telling them to throw out what works today for something unproven. But just like developers eventually learn that taking time for refactoring speeds them up in the long run, teams need to learn the same lesson about deliberate data design in their MCP tools.

Here are some practices that can make those discussions easier, by starting with constraints that even skeptical developers can see the value in:

Build tools around verbs, not nouns. Create checkEligibility() or getRecentTickets() instead of getCustomer(). Verbs force you to think about specific actions and naturally limit scope.
Talk about minimizing data needs. Before anyone creates an MCP tool, have a discussion about what the smallest piece of data they need to provide for the AI to do its job is and what experiments they can run to figure out what the AI truly needs.
Break reads apart from reasoning. Separate data fetching from decision-making when you design your MCP tools. A simple findCustomerId() tool that returns just an ID uses minimal tokens—and might not even need to be an MCP tool at all, if a simple API call will do. Then getCustomerDetailsForRefund(id) pulls only the specific fields needed for that decision. This pattern keeps context focused and makes it obvious when someone’s trying to fetch everything.
Dashboard the waste. The best argument against data hoarding is showing the waste. Track the ratio of tokens fetched versus tokens used and display them in an “information radiator” style dashboard that everyone can see. When a tool pulls 5,000 tokens but the AI only references 200 in its answer, everyone can see the problem. Once developers see they’re paying for tokens they never use, they get very interested in fixing it.

Quick smell test for data hoarding

Tool names are nouns (getCustomer()) instead of verbs (checkEligibility()).
Nobody’s ever asked, “Do we really need all these fields?”
You can’t tell which system owns which piece of data.
Debugging requires detective work across multiple data sources.
Your team rarely or never discusses the data design of MCP tools before building them.

Looking Forward

MCP is a simple but powerful tool with enormous potential for teams. But because it can be a critically important pillar of your entire application architecture, problems you introduce at the MCP level ripple throughout your project. Small mistakes have huge consequences down the road.

The very simplicity of MCP encourages data hoarding. It’s an easy trap to fall into, even for experienced developers. But what worries me most is that developers learning with these tools right now might never learn why data hoarding is a problem, and they won’t develop the architectural judgment that comes from having to make hard choices about data boundaries. Our job, especially as leaders and senior engineers, is to help everyone avoid the data hoarding trap.

When you treat MCP decisions with the same care you give any core interface—keeping context lean, setting boundaries, revisiting them as you learn—MCP stays what it should be: a simple, reliable bridge between your AI and the systems that power it.

Scale Tiny Projects into a Resilient Data Culture

Kord Davis — Fri, 07 Nov 2025 12:18:48 +0000

In today’s fast-paced business environment, the ultimate goal of any data effort is to enable better decisions and drive meaningful organizational outcomes. Too often, data initiatives fail because they treat data or “data culture” as the final product. However, the journey to a data-driven organization doesn’t have to start with massive, complex initiatives. Instead, leaders can strategically select and implement “tiny projects” that serve as stepping stones toward improving results. These small wins, rooted in principles of human-centered design, create momentum, secure buy-in for larger initiatives, and attract more collaborators along the way by focusing on tangible results, not just data collection.

Identifying and Scoping Tiny Projects: Starting with Empathy

The first step in this journey is to identify potential tiny projects that align with your organization’s goals. Crucially, this stage is driven by empathy, the foundational principle of human-centered design (HCD), which means putting the needs and experiences of the people—the users—at the center of the solution.

These projects should be manageable in scope but impactful enough to demonstrate value.

Here are some tips for selecting the right projects:

Focus on pain points (the empathy phase)

Look for areas within your organization where data could alleviate existing challenges. For example, a marketing team might struggle to analyze customer feedback effectively. A tiny project could involve using data analytics to identify key themes in customer sentiment from recent campaigns. This user-driven starting point ensures the solution is relevant and immediately valued.

Leverage existing resources

Consider projects that utilize tools and data already available within your organization. This approach minimizes costs and reduces the time needed for implementation. For instance, a sales team could analyze historical sales data to identify trends and improve forecasting. A great example of this is a project where a team of three—a data analyst, a policy advisor, and a communications staff member—identified over $4M in savings for a major American city. They simply used existing, albeit “dirty,” data to find cost reductions in postal charges.

Set clear objectives

Define specific, measurable goals for each tiny project. This clarity will help teams understand what success looks like and keep them focused. For example, if the goal is to reduce customer churn, aim for a specific percentage reduction within a set time frame.

Showcasing Wins to Build Momentum: Testing and Iteration

Once you’ve identified and scoped your tiny projects, the next step is to execute them effectively and showcase the wins. Celebrating small successes is crucial for building momentum and gaining support for future initiatives. In HCD terms, these tiny projects are rapid prototypes designed for quick testing and feedback.

Here’s how to do it:

Communicate results

Share the outcomes of your tiny projects with the broader organization. Use visual aids like dashboards or infographics to present data in an engaging way. Highlight not just the quantitative results, but also the qualitative benefits, such as improved team collaboration or enhanced customer satisfaction.

Gather testimonials (validating the prototype)

Encourage team members involved in the projects to share their experiences. Personal stories about how data-driven decisions made a difference can resonate more deeply than numbers alone. These testimonials provide qualitative feedback to validate the solution’s impact, illustrating the value of a data culture to skeptics. A powerful example of this is a team of four from a major metro area—including an HR person for the police department, a data analyst, a program manager, and a police officer—who, in less than two days, identified several constraints in their police department’s diversity hiring practices. Using only a small dataset, Post-it notes, and pens, they leveraged their collective knowledge and experience. Their results were shared with law enforcement leadership and led to direct policy and communication changes.

Create a feedback loop (continuous improvement)

After completing a tiny project, gather feedback from participants and stakeholders. This input can help refine future projects and demonstrates a commitment to continuous improvement, which is central to the iterative nature of HCD. It also fosters a sense of ownership among team members, encouraging them to engage in future initiatives.

Securing Buy-In for Larger Initiatives: Scaling the Design

As you build momentum with tiny projects, you’ll find it easier to secure buy-in for larger data initiatives. The successful prototypes created through the small projects provide the evidence needed to support scaling.

Here are some strategies to help you gain support:

Align with organizational goals

When proposing larger projects, ensure they align with the broader objectives of the organization. Demonstrating how these initiatives can drive strategic goals will make it easier to gain leadership support.

Showcase scalability

Use the successes of tiny projects to illustrate how larger initiatives can build on these foundations. For example, if a small project successfully improved customer insights, propose a larger initiative that expands this analysis across multiple customer segments.

Engage stakeholders early

Involve key stakeholders in the planning stages of larger initiatives. Their input can help shape the project and increase their investment in its success. This collaborative approach fosters a sense of shared ownership and commitment.

Attracting More Collaborators: Designing the Experience

As your organization begins to embrace a data-first culture, you’ll naturally attract more collaborators. It’s not just about a top-down mandate; it’s about creating an environment where people want to be involved. This is where human-centered design is applied to the process itself, making participation intrinsically rewarding.

Here’s how to encourage participation and make your data projects a magnet for talent:

Create cross-functional teams

Encourage collaboration across departments by forming cross-functional teams for data projects. This diversity of perspectives can lead to more innovative solutions and a stronger sense of community.

Offer training and resources

Provide training sessions and resources to help employees feel more comfortable with data tools and analytics. When team members feel equipped to contribute, they’re more likely to engage in data initiatives.

Celebrate collaboration

Recognize and reward collaborative efforts within your organization. Highlighting team achievements reinforces the value of working together and encourages others to join in.

Best Practices for Fostering a Collaborative Environment: HCD in Action

To truly make your data projects a success, you need to set up the right conditions for collaboration. The best results often come from casual, no-pressure environments where a diverse group of people can work together effectively.

Let participants inform their tiny project challenge (user agency)

A powerful way to spark collaboration is to allow participants to collaborate on their data problem topics. This aligns with the HCD principle of cocreation, instantly building synergy and a shared sense of purpose. This often reveals that people from different departments, many of whom have never met, are facing the exact same challenge but from different perspectives. They are often overjoyed to find a kindred spirit to collaborate and innovate with on a solution.

Optimize for interaction by balancing in-person and virtual collaboration

While the digital tools supporting remote work have expanded reach and accessibility, the choice of collaboration method for tiny projects is critical. In-person collaboration remains the most effective way to foster rapid, creative problem-solving. Being in the same room allows for spontaneous brainstorming, an immediate shared sense of energy, and the ability to read nonverbal cues, which accelerates the HCD empathy and ideation phases. The pros are speed, depth of connection, and cocreation quality. However, virtual or remote collaboration offers substantial pros like lower cost, greater geographic diversity, and increased participant accessibility, which can be invaluable for gathering a wider range of data perspectives. Therefore, for truly tiny, complex, or urgent problem-solving, prioritize the high-bandwidth interaction of in-person settings, but leverage virtual tools for asynchronous check-ins, data sharing, and ensuring wider organizational inclusion.

Cultivate a “freedom to fail” mindset (psychological safety)

Explicitly state that this is a no-pressure environment where experimentation is encouraged. When people aren’t afraid of making mistakes, they are more willing to try new ideas, challenge assumptions, and learn from what doesn’t work. This psychological safety is crucial for rapid iteration and innovation, the hallmarks of effective HCD.

Ensure a diverse mix of people

A successful project isn’t just about data and technology. Bring together a highly diverse range of people from different departments, with varying levels of experience, and from a variety of disciplines. A project team that includes an HR person, a police officer, a data analyst, and a program manager can uncover insights that a homogeneous group never would.

Design for active collaboration (experiential design)

Move beyond traditional conference room setups. Create a comfortable environment that is suitable for active collaboration. This means having space to stand up, walk around, and use whiteboards or walls for posting ideas. Getting people out from behind their laptops encourages dynamic interaction and shared focus, as HCD principles apply to designing the process experience itself.

Provide healthy food and drinks

Simple as it may seem, offering readily available, healthy, and tasty food and beverages makes a huge difference. It removes a minor distraction, signals that the organization values the team’s time, and fosters a more relaxed, communal atmosphere.

The Value Proposition for Collaborators: Designing for Intrinsic Motivation

The true secret to attracting collaborators isn’t just about providing resources—it’s about making the process personally and professionally rewarding. Tiny projects are an excellent way to do this because they’re inherently fun and self-edifying, and often lead to quick, visible success.

When projects are small and have a clear, rapid path to a solution, people are more willing to participate. They see it as a low-risk opportunity to experiment and have some fun. This is a chance to step away from their regular duties and engage in a different kind of problem-solving. This shift in mindset can be a refreshing and enjoyable experience.

Beyond the enjoyment, tiny projects offer a chance for personal and professional growth. Team members get to learn from their peers in different departments, gaining new skills and perspectives. It’s a form of on-the-job training that is far more engaging and relevant than a traditional workshop. They feel a sense of self-edification as they solve a real-world problem and gain confidence in their abilities.

Finally, the success of these projects is often wildly, visibly, and rapidly successful. Because the scope is small, teams can quickly deliver tangible results. A project that saves a city millions of dollars or leads to direct policy changes in a police department in less than two days is a powerful story.

These successes are great for the organization, but they’re also a massive win for the individuals involved. They get to demonstrate their expertise and showcase the value they can add beyond their job description. This visibility and recognition are powerful motivators, encouraging people to participate in future projects because they want to have fun, be successful, and add value again.

You don’t have to do many tiny projects to see the effect. The personal benefits—the fun, the learning, the rapid success—become organizational cultural values that expand rapidly to other individuals and parts of the organization. It’s the massively exponential positive feedback loop that transforms a data culture, one small, successful project at a time.

Scaling a Data-First Culture

Ultimately, the goal is to scale a data-first culture that extends beyond individual projects. By starting with tiny projects as HCD prototypes, showcasing wins as validated solutions, securing buy-in, and attracting collaborators through a well-designed process, organizations can create a sustainable environment where data-driven decision-making thrives.

As you embark on this journey, remember that building a resilient data culture is a marathon, not a sprint. Each tiny project is a step toward a larger vision, and with each success, you’ll be laying the groundwork for a future where data is at the heart of your organization’s strategy. Embrace the process, celebrate the wins, and watch as your data culture flourishes.

Technology Trends for 2025

Mike Loukides — Tue, 14 Jan 2025 11:12:50 +0000

Welcome to our annual report on the usage of the O’Reilly learning platform. It’s been an exciting year, dominated by a constant stream of breakthroughs and announcements in AI, and complicated by industry-wide layoffs. Generative AI gets better and better—but that trend may be at an end. Now the ball is in the application developers’ court: Where, when, and how will AI be integrated into the applications we build and use every day? And if AI replaces the developers, who will be left to do the integration? Our data shows how our users are reacting to changes in the industry: Which skills do they need to brush up on? Which do they need to add? What do they need to know to do their day-to-day work? In short: Where have we been in the past year, and where are we going?

We aren’t concerned about AI taking away software developers’ jobs. Ever since the computer industry got started in the 1950s, software developers have built tools to help them write software. AI is just another tool, another link added to the end of that chain. Software developers are excited by tools like GitHub Copilot, Cursor, and other coding assistants that make them more productive.

That’s only one of the stories we’re following. Here are a few of the others:

The next wave of AI development will be building agents: software that can plan and execute complex actions.
There seems to be less interest in learning about programming languages, Rust being a significant exception. Is that because our users are willing to let AI “learn” the details of languages and libraries for them? That might be a career mistake.
Security is finally being taken seriously. CEOs are tired of being in the news for the wrong reasons. AI tools are starting to take the load off of security specialists, helping them to get out of “firefighting” mode.
“The cloud” has reached saturation, at least as a skill our users are studying. We don’t see a surge in “repatriation,” though there is a constant ebb and flow of data and applications to and from cloud providers.
Professional development is very much of interest to our users. Specifically, they’re focused on being better communicators and leading engineering teams.

All of these trends have been impacted, if not driven, by AI—and that impact will continue in the coming year.

Finally, some notes about methodology. Skip this paragraph if you want; we don’t mind. This report is based on the use of O’Reilly’s online learning platform from January 1, 2024, to September 30, 2024. Year-over-year comparisons are based on the same period in 2023. The data in each graph is based on O’Reilly’s “units viewed” metric, which measures the actual use of each item on the platform. It accounts for different usage behavior for different media: text, courses, and quizzes. In each graph, the data is scaled so that the item with the greatest units viewed is 1. That means items within a graph are comparable to each other, but you can’t compare an item in one graph to an item in another. And all percentages are reported with two significant digits.

Skills

When we look at how our customers use the O’Reilly learning platform, we always think in terms of skills. What skills are they trying to gain? And how are they trying to improve their knowledge? This year, one thread that we see across all of our platform is the importance of artificial intelligence. It’s all about upskilling in the age of AI.

Artificial Intelligence

It will surprise absolutely nobody that AI was the most active category in the past year. For the past two years, large models have dominated the news. That trend started with ChatGPT and its descendants, most recently GPT 4o1. But unlike 2022, when ChatGPT was the only show anyone cared about, we now have many contenders. Claude has emerged as a favorite among programmers. After a shaky start, Google’s Gemini models have become solid performers. Llama has established itself as one of the top models and as the matriarch of a rich ecosystem of open¹ models. Many of the open models can deliver acceptable performance when running on laptops and phones; some are even targeted at embedded devices.

So what does our data show? First, interest in almost all of the top skills is up: From 2023 to 2024, Machine Learning grew 9.2%; Artificial Intelligence grew 190%; Natural Language Processing grew 39%; Generative AI grew 289%; AI Principles grew 386%; and Prompt Engineering grew 456%. Among the top topics, the most significant decline was for GPT itself, which dropped by 13%—not a huge decline but certainly a significant one. Searches for GPT peaked in March 2023 and have been trending downward ever since, so our search data matches our usage data.

We’re used to seeing interest move from a more general high-level topic to specific skills as an industry sector matures, so this trend away from GPT in favor of more abstract, high-level topics is counterintuitive. But in context, it’s fairly clear what happened. For all practical purposes, GPT was the only game in town back in 2023. The situation is different now: There’s lots of competition. These other models don’t yet show up significantly in search or usage data, but the users of our platform have figured out what’s important: not learning about GPT or Claude or Gemini or Mistral but getting the background you need to make sense of any model. Discovering a workflow that fits your needs is important, and as Simon Willison points out, your ideal workflow may actually involve using several models. Recent models are all good, but they aren’t all good in the same way.

AI has had a great year, but will it continue to show gains in 2025? Or will it drop back, much as ChatGPT and GPT did? That depends on many factors. Gartner has generative AI slipping into the “trough of disillusionment”—and whatever you think of the technology’s promise, remember that the disillusionment is a sociological phenomenon, not a technical one, and that it happens because new technologies are overhyped. Regardless of generative AI’s long-term promise, we expect some disillusionment to set in, especially among those who haven’t properly understood the technology or its capabilities.

Prompt Engineering, which gained 456% from 2023 to 2024, stands out. A 456% gain isn’t as surprising as it seems; after all, people only started talking about prompt engineering in 2023. Although “prompt engineering” was bandied about as a buzzword, it didn’t become a skill that employers were looking for until late in 2023, if that. That may be an early warning signal for AI disillusionment. Searches for “prompt engineering” grew sharply in 2023 but appeared to decline slightly in 2024. Is that noise or signal? If disillusionment in Prompt Engineering sets in, we’ll also see declines in higher-level topics like Machine Learning and Artificial Intelligence.

There’s a different take on the future of prompt engineering. There have been a number of arguments that the need for prompt engineering is temporary. As generative AI improves, this line of reasoning contends, we will no longer need to write complex prompts that specify exactly what we want the AI to do and how to do it. Prompts will be less sensitive to exactly how they’re worded; changing a word or two will no longer give a completely different result. We’ll no longer have to say “explain it to me as if I were five years old” or provide several examples of how to solve a problem step-by-step.

Some recent developments point in that direction. Several of the more advanced models have made the “explain it to me” prompts superfluous. OpenAI’s GPT 4o1 has been trained in a way that maximizes its problem-solving abilities, not just its ability to string together coherent words. At its best, it eliminates the need to write prompts that demonstrate how to solve the problem (a technique called few-shot prompting). At worst, it “decides” on an inappropriate process, and it’s difficult to convince it to solve the problem a different way. Anthropic’s Claude has a new (beta) computer use feature that lets the model use browsers, shells, and other programs: It can click on links and buttons, select text, and do much more. (Google and OpenAI are reportedly working on similar features.) Enabling a model to use the computer in much the same way as a human appears to give it the ability to solve multistep problems on its own, with minimal description. It’s a big step toward a future full of intelligent agents: linked AI systems that cooperate to solve complex problems. However, Anthropic’s documentation is full of warnings about serious security vulnerabilities that remain to be solved. We’re thrilled that Anthropic has been forthright about these weaknesses. But still, while computer use may be a peek at the future, it’s not ready for prime time.

AI will almost certainly slide into a trough of disillusionment; as I’ve said, the trough has more to do with sociology than with technology. But OpenAI and Anthropic are demonstrating important paths forward. Will these experiments bear fruit in the next year? We’ll see.

Artificial intelligence

Many skills associated with AI also showed solid gains. Use of content about Deep Learning is up 14%, Generative Models is up 26%, and GitHub Copilot is up 471%. Use of content about the major AI libraries was up slightly: PyTorch gained 6.9%, Keras increased 3.3%, and Scikit-Learn gained 1.7%. Usage of TensorFlow content declined 28%; its continued decline indicates that PyTorch has won the hearts and minds of AI developers.

These gains—particularly Copilot’s—are impressive, but a more important story concerns two skills that came out of nowhere: Usage of content about LangChain is on a par with PyTorch, and RAG is on a par with Keras. Neither of these skills were in last year’s report; in 2023, content usage for LangChain and RAG was minimal, largely because little content existed. They’ve caught on because both LangChain and RAG are tools for building better applications on top of AI models. GPT, Claude, Gemini, and Llama aren’t the end of the road. RAG lets you build applications that send private data to a model as part of the prompt, enabling the model to build answers from data that wasn’t in its training set. This process has several important consequences: It minimizes the probability of error or “hallucination”; it makes it possible to attribute answers to the sources from which they came; and it often makes it possible to use a much smaller and more economical model.

LangChain is the first of many frameworks for building AI agents. (OpenAI has Swarm; Google has an Agent Builder that’s part of Vertex; Salesforce and other vendors also have offerings.) Agents are software that can plan and execute multistage actions, many of which are delegated to other AI models. Claude’s computer use API is another facet of this trend, along with whatever products OpenAI and Google may be building. Saying that usage has increased 26 million percent isn’t to the point—but realizing that LangChain has grown from near zero to a platform on a par with PyTorch is very much so. Agentic applications are certainly the next big trend within AI.

Skills needed for AI

Data

Artificial intelligence relies heavily on what we used to call (and perhaps still call) data science. Building AI models requires data at unprecedented scale. Building applications with RAG requires a portfolio of data (company financials, customer data, data purchased from other sources) that can be used to build queries, and data scientists know how to work with data at scale.

Therefore, it’s not surprising that Data Engineering skills showed a solid 29% increase from 2023 to 2024. SQL, the common language of all database work, is up 3.2%; Power BI was up 3.0%, along with the more general (and much smaller) topic Business Intelligence (up 5.0%). PostgreSQL is close to edging ahead of MySQL, with a 3.6% gain. Interest in Data Lake architectures rose 59%, while the much older Data Warehouse held steady, with a 0.3% decline. (In our skill taxonomy, Data Lake includes Data Lakehouse, a data storage architecture that combines features of data lakes and data warehouses.) Finally, ETL grew 102%. With the exception of ETL, the gains are smaller than the increases we saw for AI skills, but that makes sense: AI is an exciting new area, and data is a mature, stable category. The number of people who need specialized skills like ETL is relatively small but obviously growing as data storage becomes even more important with AI.

It’s worth understanding the connection between data engineering, data lakes, and data lakehouses. Data engineers build the infrastructure to collect, store, and analyze data. The data needed for an AI application almost always takes many forms: free-form text, images, audio, structured data (for example, financial statements), etc. Data often arrives in streams, asynchronously and more or less constantly. This is a good match for a data lake, which stores data regardless of structure for use later. Because data receives only minimal processing when it arrives, it can be stored in near real time; it’s cleaned and formatted in application-specific ways when it’s needed. Once data has been stored in a data lake, it can be used for traditional business analytics, stored in a vector or graph database for RAG, or put to almost any other use. A data lakehouse combines both structured and unstructured data in a single platform.

Data analysis (including databases)

Software Development

What do software developers do all day? They write software. Programming is an important part of the job, but it’s not the whole thing; best estimates are that programmers spend roughly 20% of their time writing code. The rest of their time is spent understanding the problems they’re being asked to solve, designing appropriate solutions, documenting their work, updating management on the status of their projects, and much more.

Software architecture, which focuses on understanding a customer’s requirements and designing systems to meet those requirements, is an important part of the overall software development picture. It’s a skill to which many of our software developers and programmers aspire.

Architecture

This year’s data shows that software architecture continues to be one of the most desirable skills in the industries we serve. Usage of material about Software Architecture rose 5.5% from 2023 to 2024, a small but significant increase. But it’s more important to ask why it increased. A position in software architecture may be perceived as more secure in a time of layoffs, and it’s often perceived as another step forward in a career that moves from junior programmer to senior to lead. In addition, the rise of AI presents many architectural challenges: Do we try to build our own model? (The answer is usually “no.”) Should we use an AI service provider like OpenAI, Anthropic, Microsoft, or Google, or should we fine-tune and host our own model on our own infrastructure? How do we build applications that are safe (and how do we define “safe”)? How do we evaluate performance? These questions all have a bearing on software architecture. Furthermore, AI might provide tools to help software architects, but so far, these tools can do little for the substance of the job: understanding customers’ needs and helping them define what they want to build. With AI in the picture, we’re all building new kinds of applications—and those applications require architects to help design them.

In this context, it’s no surprise that Enterprise Architecture is up 17% and Distributed Systems is up 35%. Enterprise architecture is a staple: As Willie Sutton said about banks, “That’s where the money is.” It’s a good bet that many enterprises are trying to integrate AI into their systems or update legacy systems that are no longer scalable or maintainable. We can (and do) make the same argument about distributed systems. Modern enterprises work on a scale that was unimaginable a few decades ago. Scale isn’t just for companies like Amazon and Google. To survive, even small businesses need to develop an online presence—and that means building systems in the cloud that can handle surges in demand gracefully. It means building systems that can withstand outages. Distributed systems aren’t just massive deployments with hundreds of thousands of nodes. Your business may only require a dozen nodes, but regardless of the scale, it still faces the architectural challenges that come with distributed systems.

Some of the more significant ideas from the past decade seem to be falling out of favor. Microservices declined 24%, though content use is still substantial. Domain-Driven Design, which is an excellent skill for designing with microservices, is down 22%. Serverless is down 5%; this particular architectural style was widely hyped and seemed like a good match for microservices but never really caught on, at least based on our platform’s data.

What’s happening? Microservice architectures are difficult to design and implement, and they aren’t always appropriate—from the start, the best advice has been to begin by building a monolith, then break the monolith into microservices when it becomes unwieldy. By the time you reach that stage, you’ll have a better feel for what microservices need to be broken out from the monolith. That’s good advice, but the hype got ahead of it. Many organizations that would never need the complexity of microservices were trying to implement them with underskilled staff. As an architectural style, microservices won’t disappear, but they’re no longer getting the attention they once were. And new ideas, like modular monoliths, may catch on in the coming years; modularity is a virtue regardless of scale or complexity.

Software architecture and design

Programming languages

Last year’s report showed that our users were consuming less content about programming languages. This year’s data continues that trend. We see a small drop for Python (5.3%) and a more significant drop for Java (13%). And even C++, which showed healthy growth from 2022 to 2023, is down 9% in 2024.

On the other hand, C is up (1.3%), and so is C# (2.1%). Rust is up 9.6%. The small increases in C and C# may just be noise. C is well-entrenched and isn’t going anywhere fast. Neither is C++, despite its drop. Rust’s increase continues a growth trend that stretches back several years; that’s an important signal. Rust is clearly winning over developers, at least for new projects. Now that the US government is placing a priority on memory safety, Rust’s emphasis on memory safety serves it well. Rust isn’t the first programming language to claim memory safety, nor will it be the last. (There are projects to add memory safety to C++, for example.) But right now, it’s the best positioned.

Aside from Rust, though, we need to ask what’s happening with programming skills. A few forces are applying downward pressure. Industry-wide layoffs may be playing a role. We’ve downplayed the effect of layoffs in the past, but we may have to admit that we were wrong: This year, they may be taking a bite out of skills development.

Could generative AI have had an effect on the development of programming language skills? It’s possible; shortly after GPT-3 was released, Simon Willison reported that he was learning Rust with the help of ChatGPT and Copilot, and more recently that he’s used Claude to write Rust code that he has in production, even though he doesn’t consider himself a skilled Rust developer.

It would be foolish to deny that generative AI will help programmers to become more productive. And it would be foolish to deny that AI will change how and what we learn. But we have to think carefully about what “learning” means, and why we learn in the first place. Programmers won’t have to remember all the little details of programming languages—but that’s never been the important part of programming, nor has rote memorization been an important part of learning. Students will never have to remember a half dozen sorting algorithms, but computer science classes don’t teach sorting algorithms because committing algorithms to memory is important. Every programming language has a sort() function somewhere in its libraries. No, sorting is taught because it’s a problem that everyone can understand and that can be solved in several different ways—and each solution has different properties (performance, memory use, etc.). The point is learning how to solve problems and understanding the properties of those solutions. As Claire Vo said in her episode of Generative AI in the Real World, we’ll always need engineers who think like engineers—and that’s what learning how to solve problems means. Whether lines end in a semicolon or a colon or whether you use curly braces, end statements, or tabs to delimit blocks of code is immaterial.

Programming languages

The perception that generative AI minimizes the need to learn programming languages may limit the use of language-oriented content on our platform. Does that benefit the learners? If someone is using AI to avoid learning the hard concepts—like solving a problem by dividing it into smaller pieces (like quicksort)—they are shortchanging themselves. Shortcuts rarely pay off in the long term; coding assistants may help you to write some useful code, but those who use them merely as shortcuts rather than as learning tools are missing the point. Unfortunately, the history of teaching—going back centuries if not millennia—has stressed memorization. It’s time for both learners and teachers to grow beyond that.

Learning is changing as a result of AI. The way we teach, and the way our users want to be taught, is changing. Building the right kind of experiences to facilitate learning in an AI-enabled environment is an ongoing project for our learning platform. In the future, will our users learn to program by completing AI-generated tutorials that are customized in real time to their needs and abilities? That’s where we’re headed.

Web programming

Use of content about web programming skills is down, with few exceptions. A number of factors might be contributing to this. First, I can’t think of any significant new web frameworks in the past year; the field is still dominated by React (down 18%) and Angular (down 10%). There is some life near the bottom of the chart. The Svelte framework had significant growth (24%); so did Next.js (8.7%). But while these frameworks have their adherents, they’re far from dominant.

PHP (down 19%) still claims to have built the lion’s share of the web, but it’s not what developers reach for when they want to build something new, particularly if that “new” is a complex web application. The PHP world has been rocked by a bitter fight between the CEOs of Automattic (the developers of WordPress, by far the most important PHP framework) and WP Engine (a WordPress hosting platform). That fight started too late to affect this year’s results significantly, but it might weigh heavily next year.

A more significant development has been the movement away from complex platforms and back toward the simplicity of the earlier web. Alex Russell’s “Reckoning” posts summarize many of the problems. Our networks and our computers are much, much faster than they were 20 or 25 years ago, but web performance hasn’t improved noticeably. If anything, it’s gotten worse. We still wait for applications to load. Applications are hard to develop and have gotten harder over the years. There are several new frameworks that may (or may not) be lighter-weight, such as HTMX, Ludic, Glitch, and Cobalt. None of them have yet made a dent in our data, in part because none have built enough of a following for publishers and trainers to develop content—and you can’t have any units viewed if there isn’t anything to view. However, if you want an experience that isn’t dominated by heavyweight frameworks, doesn’t require you to become a JavaScript expert, and puts the fun back into building the web, this is where to look.

Web development

Web dev is a discipline that has been ill-served by shortcuts to learning. We hear too often about boot camp graduates who know a few React tricks but don’t understand the difference between React and JavaScript (or even know that JavaScript exists, let alone other programming languages). These programmers are very likely to lose their jobs to AI, which can already reproduce all the basic React techniques they’ve learned. Learning providers need to think about how AI is changing the workplace and how their students can partner with AI to build something beyond what AI can build on its own. Part of the solution is certainly a return to basics, ensuring that junior developers understand the tools with which they’re working.

IT Operations

Operations is another area where the trends are mostly downward. It may be small consolation, but the drops for several of the most important topics are relatively small: Linux is down 1.6%, Terraform is down 4.0%, and Infrastructure as Code is down 7.3%. As a skill, Terraform seems little hurt by the fork of Terraform that created the open source OpenTofu project, perhaps because the OpenTofu developers have been careful to maintain compatibility with Terraform. How this split plays out in the future is an open question. It’s worth noting the precipitous drop in Terraform certification (down 43%); that may be a more important signal than Terraform itself.

Kubernetes is down 20%. Despite that drop, which is sharper than last year’s 6.9% decrease, content teaching Kubernetes skills remains the second most widely used group in this category, and Kubernetes certification is up 6.3%. Last year, we said that Kubernetes needed to be simpler. It isn’t. There are no viable alternatives to Kubernetes yet, but there are different ways to deploy it. Kubernetes as a service managed by a cloud provider is certainly catching on, putting the burden of understanding every detail of Kubernetes’s operation on the shoulders of the provider. We also pointed to the rise of developer platforms; this year, the buzzword is “platform engineering” (Camille Fournier and Ian Nowland’s book is excellent), but as far as Kubernetes is concerned, it’s the same thing. Platform engineers can abstract knowledge of Kubernetes into a platform, minimizing software developers’ cognitive overhead. The result is that the number of people who need to know about Kubernetes is smaller.

Both DevOps (down 23%) and SRE (down 15%) dropped. There’s certainly some frustration with DevOps: Has it paid off? We ask a different question: Has it ever been tried? One problem with DevOps (which it shares with Agile) is that many companies “adopted” it in name but not in essence. They renamed a few positions, hired a few DevOps engineers, maybe created a DevOps group, never realizing that DevOps wasn’t about new job titles or new specialties; it was about reducing the friction between software development teams and operations teams. When you look at it this way, creating new groups and hiring new specialists can only be counterproductive. And the result is predictable: You don’t have to look far to find blogs and whitepapers claiming that DevOps doesn’t work. There’s also frustration with ideas like “shift left” and DevSecOps, which envision taking security into account from the start of the development process. Security is a different discussion, but it’s unclear how you build secure systems without taking it into account from the start. We’ve spent several decades building software and trying to fold security in at the last minute—we know how well that works.

Infrastructure and operations

In any case, the industry has moved on. Platform engineering is, in many ways, a natural outgrowth of both DevOps and SRE. As I’ve argued, the course of operations has been to increase the ratio of computers to operators. Is platform engineering the next step, allowing software developers to build systems that can handle their own deployment and routine operations without the help of operations staff?

IT certifications

General IT certifications, apart from security, trended downward. Use of content to prepare for the CompTIA A+ exam, an entry-level IT certification, was down 15%; CompTIA Network+ was down 7.9%. CompTIA’s Linux+ exam held its own, with a decline of 0.3%. On our platform, we’ve seen that Linux resources are in high demand. The slight decline for Linux-related content (1.6%) fits with the very small decrease in Linux+ certification.

For many years, Cisco’s certifications have been the gold standard for IT. Cisco Certified Network Associate (CCNA), a fairly general entry-level IT certification, showed the greatest usage and the smallest decline (2.2%). Usage of content to prepare for the Cisco Certified Network Practitioner (CCNP) exams, a cluster of related certifications on topics like enterprise networking, data centers, and security, dropped 17%. The Cisco Certified Internet Engineer (CCIE) exams showed the greatest decline (36%). CCIE has long been recognized as the most comprehensive and in-depth IT certification. We’re not surprised that the total usage of this content is relatively small. CCIE represents the climax of a career, not the start. The number of people who attain it is relatively small, and those who do often include their CCIE number with their credentials. But the drop is surprising. It’s certainly true that IT is less focused on heavy-duty routing and switching for on-prem data centers (or even smaller machine rooms) than it was a few years ago. That work has largely been offloaded to cloud providers. While routers and switches haven’t disappeared, IT doesn’t need to support as wide a range of resources: They need to support office WiFi, some databases that need to remain on-premises, and maybe a few servers for office-related tasks. They’re very concerned about security, and as we’ll see shortly, security certifications are thriving. Is it possible that Cisco and its certifications aren’t as relevant as they used to be?

As we mentioned above, we also saw a drop in the relatively new certification for HashiCorp’s Terraform (43%). That’s a sharp decline—particularly since use of content about Terraform itself only declined 4.0%, showing that Terraform skills remain highly desirable regardless of the certification. A sudden drop in certification prep can be caused by a new exam, making older content out-of-date, but that isn’t the case here. Terraform certification certainly wasn’t helped by HashiCorp’s switch to a Business Source License or the subsequent fork of the Terraform project. IBM’s pending acquisition of Terraform (set to close before the end of 2024) may have introduced more uncertainty. Is the decline in interest for Terraform certification an indicator of dissatisfaction in the Terraform community?

Certifications for IT

The Kubernetes and Cloud Native Associate (KCNA, up 6.3%) was a bright spot in IT certification. Whether or not Kubernetes is overly complex (perhaps because it’s overly complex) and whether or not companies are moving out of the cloud, KCNA certification is a worthwhile asset. Cloud native applications aren’t going away. And whether they’re managing Kubernetes complexity by building developer platforms, using a Kubernetes provider, or using some other solution, companies will need people on their staff who can demonstrate that they have Kubernetes skills.

Cloud and cloud certifications

Content use for the major cloud providers and their certifications was down across all categories, with one exception: Use of content to prepare for Google Cloud certifications is up 2.2%.

What does that tell us, if anything? Are we looking at a “cloud repatriation” movement in full swing? Are our customers moving their operations back from the cloud to on-prem (or hosted) data centers? Last year, we said that we see very little evidence that repatriation is happening. This year? An article in The New Stack argues that cloud repatriation is gathering steam. While that might account for the decline in the use of cloud-related content, we still see little evidence that repatriation is actually happening. Two case studies (37signals and GEICO) don’t make a trend. The ongoing expense of operating software in the cloud probably is greater than the cost of running it on-premises. But the cloud allows for scaling on demand, and that’s important. It’s true, few businesses have the sudden usage peaks that are driven by events like retail’s Black Friday. But the cloud providers aren’t just about sudden 10x or 100x bursts of traffic; they also allow you to scale smoothly from 1x to 1.5x to 2x to 3x, and so on. It saves you from arguing that you need additional infrastructure until the need becomes a crisis, at which point, you don’t need to grow 1.5x; you need 5x. After moving operations to the cloud and experiencing a few years of growth—even if that growth is moderate—moving back to an on-premises data center will require significant capital expense. It will probably require gutting all the infrastructure that you haven’t been using for the past year and replacing it with something up-to-date.

Does this mean that cloud providers are “roach motels,” where you can move in but you can’t move out? That’s not entirely untrue. But the ease of scaling by allocating a few more servers and seeing a slightly higher bill the next month can’t be ignored, even if those slightly higher bills sound like the proverbial story of boiling the frog. Evaluating vendors, waiting for delivery, installing hardware, configuring hardware, testing hardware—that’s effort and expense that businesses are offloading to cloud vendors. The ability to scale fluidly is particularly important in the age of AI. Few companies have the skills needed to build on-premises infrastructure for AI, with its cooling and power requirements. That means either buying AI services directly from cloud providers or building infrastructure to host your own models. And of course, the cloud providers have plenty of help for companies that need to use their high-end GPUs. (Seriously—if you want to host your AI application on-premises, see how long it will take to get delivery of NVIDIA’s latest GPU.) The reality, as IDC concluded in a survey of cloud use, is that “workload repatriation from public cloud into dedicated environments goes hand in hand with workload migration to public cloud activities, reflecting organizations’ continuous reassessment of IT environments best suited for serving their workloads.” That is, there’s a constant ebb and flow of workloads to and from public clouds as companies adapt their strategies to the business environment.

Cloud providers and certifications

The buzzword power of “the cloud” lasted longer than anyone could reasonably have expected, but it’s dead now. However, that’s just the buzzword. Companies may no longer be “moving to the cloud”; that move has already happened, and their staff no longer need to learn how to do it. Organizations now need to learn how to manage the investments they’ve made. They need to learn which workloads are most appropriate for the cloud and which are better run on-premises. IT still needs staff with cloud skills.

Security

Security Governance drove the most content use in 2024, growing 7.3% in the process and overtaking Network Security (down 12%). The rise of governance is an important sign: “Security” is no longer an ad hoc issue, fixing vulnerabilities in individual applications or specific services. That approach leads to endless firefighting and eventually failure—and those failures end up in the major news media and result in executives losing their jobs. Security is a company-wide issue that needs to be addressed in every part of the organization. Confirming the growing importance of security governance, interest in Governance, Risk, and Compliance (GRC) grew 44%, and Compliance grew 10%. Both are key parts of security governance. Security architecture also showed a small but significant increase (3.7%); designing a security architecture that works for an entire organization is an important part of looking at the overall security picture.

The use of content about Application Security also grew significantly (17%). That’s a very general topic, and it perhaps doesn’t say much except that our users are interested in securing their applications—which goes without saying. But what kinds of applications? All of them: web applications, cloud applications, business intelligence applications, everything. We get a bigger signal from the increase in Zero Trust (13%), a particularly important strategy for securing services in which every user, human or otherwise, must authenticate itself to every service that it uses. In addition, users must have appropriate privileges to do what they need to do, and no more. It’s particularly important that zero trust extends authentication to nonhuman users (other computers and other services, whether internal or external). It’s a response to the “hard, crunchy outside, but soft chewy inside” security that dominated the 1990s and early 2000s. Zero trust assumes that attackers can get through firewalls, that they can guess passwords, and that they can compromise phones and computers when they’re outside the firewall. Firewalls, good passwords, and multifactor authentication systems are all important—they’re the hard, crunchy outside that prevents an attacker from getting in. Zero trust helps keep attackers outside, of course—but more than that, it limits the damage they can do once they’re inside.

Security skills

We’re puzzled by the drop in use of content about Network Security, which corresponds roughly to the drop in Cisco certifications. Network Security is still the second most widely used skill, but it’s down 12% from 2023 to 2024. Perhaps network security isn’t deemed as important when employees wander in and out of company networks and applications are distributed between in-house servers and the cloud. We hope that our users aren’t making that mistake. A bigger issue is that networks haven’t changed much in the past few years: We’re still using IPv4; we’re still using routers, switches, and firewalls, none of which have changed significantly in recent years. What has changed is the way security is implemented. Cloud computing and zero trust have moved the focus from big-iron networking devices to interactions between systems, regardless of how they are connected.

Security certifications

Security certification has been one of the biggest growth areas on our platform. As I’ve said elsewhere, security professionals love their certifications. There’s a good reason for that. In most other specialties, it’s possible to build a portfolio of programs you wrote, systems you architected, sites you’ve designed. What can a security person say in a job interview? “I stopped 10,000 people from logging in last year?” If you’ve ever monitored a public-facing Linux system, you know that claim means little. Security is cursed with the problem that the best news is no news: “Nothing bad happened” doesn’t play well with management or future employers. Neither does “I kept all the software patched, and spent time reading CVEs to learn about new vulnerabilities”—even though that’s an excellent demonstration of competence. Certification is a way of proving that you have certain skills and that you’ve met some widely recognized standards.

The CISSP (up 11%) and CompTIA Security+ (up 13%) certifications are always at the top of our lists, and this year is no exception. Our State of Security in 2024 report showed that CISSP was the certification most commonly required by employers. If there’s a gold standard for security skills, CISSP is it: It’s a thorough, comprehensive exam for people with more than five years of experience. CompTIA Security+ certification has always trailed CISSP slightly in our surveys and in platform performance, but its position in second place is uncontested. Security+ is an entry-level certification; it’s particularly desirable for people who are starting their security careers.

Security certification was especially important for government users. For most industry sectors, usage focused on programming skills in Java or Python, followed by artificial intelligence. The government sector was a strong outlier. Security and IT certifications were by far the most important topics. CompTIA Security+ and CISSP (in that order) led.

Moving beyond CISSP and Security+, many of the other security certifications also showed gains. Certified Ethical Hacker (CEH) was up 1.4%, as was the less popular CompTIA PenTest+ certification (3.3%). Certified Cloud Security Professional was up 2.4%, somewhat less than we’d expect, given the importance of the cloud to modern IT, but it’s still a gain. ISACA’s Certified in Risk and Information Systems Control (CRISC) was up 45%, Certified Information Security Manager (CISM) grew 9.3%, and Certified Information Security Auditor (CISA) was up 8.8%; these three certifications are strongly associated with security governance. The most significant declines were for the CompTIA Cybersecurity Analyst (CySA+) certification (down 13%) and CCNA Security (down 55%). The drop in CCNA Security is extreme, but it isn’t unexpected given that none of the Cisco certifications showed an increase this year.

We’re missing one important piece of the security certification puzzle. There’s no data on AI security certifications—and that’s because there aren’t any. Software that incorporates AI must be built and operated securely. That will require security experts with AI expertise (and who can demonstrate that expertise via certifications). We expect (or maybe a better phrase is “we hope”) that lack will be addressed in the coming year.

Security certifications

Professional Development

Professional development continues to be an important growth area for our audience. The most important skill, Professional Communication, grew 4.5%—not much but significant. We saw a 9.6% increase in users wanting to know more about Engineering Leadership, and a 21.5% increase in users using content about Personal Productivity.

Project Management was almost unchanged from 2023 to 2024 (up 0.01%), while the use of content about the Project Management Professional (PMP) certification grew 15%. Interest in Product Management declined 11%; it seems to be a skill that our users are less interested in. Why? For the past few years, product manager has seemed to be a trendy new job title. But in last year’s report, Product Management only showed a small gain from 2022 to 2023. Is interest in Product Management as a skill or as a job title fading?

Professional development and skills

We also saw a 7.9% decline in Leadership (aside from Engineering Leadership), and a huge 35% decline for IT Management. Are we to blame these on the corporate layoff cycle? That’s possible, but it’s too easy. IT may be affected by a general trend toward simplification and platform engineering, as we’ve discussed: A platform engineering group can do a lot to reduce cognitive overhead for developers, but it also reduces the need for IT staff. A platform engineering group doesn’t have to be large; is the need for IT staff shrinking? The decline in Leadership may be because it’s a vague, nonspecific term, unlike Engineering Leadership (which is up). Engineering Leadership is concrete and it’s something our engineering-oriented audience understands.

New Initiatives

In 2024, we introduced several new features on the O’Reilly learning platform, including badges, quizzes, and a new version of O’Reilly Answers. What are they telling us?

Badges and Quizzes

We started a badging program late in 2023: Users from business accounts can earn badges for taking courses and completing quizzes. We won’t go into the program details here, but since the program started, users have earned nearly 160,000 badges. We’re still building the program, but we’re encouraged by its first year.

Badges can give us more insight into what our users are learning. The most popular badges are for Python skills, followed by GPT and prompt engineering. Generative AI and machine learning are also high on the list. Kubernetes, despite its decline in units viewed, was the fourth-most-frequently-acquired badge, with almost the same number of badges earned as software architecture. Linux, SQL, professional communication, and Java rounded out the top 11. (Yes, 11—we wanted to include Java). The difference between Java and Python is striking, given that the use of content about these skills is similar. (Python leads Java, but not by much.) Oracle has a highly regarded Java certification program, and there’s really no equivalent for Python. Perhaps our users recognize that obtaining a Java badge is superfluous, while obtaining badges for Pythonic skills is meaningful?

Quizzes are closely tied to badges: If a final quiz exists for a course or for a book, students must pass it to earn their badge. Quiz usage appears to follow the same trends as badging, though it’s premature to draw any conclusions. While a few legacy quizzes have been on the platform for a long time (and aren’t connected to badging), the push to develop quizzes as part of the badging program only began in June 2024, and quiz usage is still as much a consequence of the time the quiz has been available on the platform as it is of the skill for which it’s testing.

Top badges earned (relative to Python)

We can also look at the expertise required by the badges that were earned. All of our content is tagged with a skill level: beginner, beginner-intermediate, intermediate, intermediate-advanced, or advanced. 42% of the badges were earned for content judged to be intermediate. 33% of the badges were earned for beginner content, while only 4.4% were for advanced content. It’s somewhat surprising that most of the badges were earned for intermediate-level content, though perhaps that makes sense given the badge program’s B2B context: For the most part, our users are professionals rather than beginners.

Badges earned by expertise level (percent)

Answers

One of our most important new features in 2024 was an upgrade to O’Reilly Answers. Answers is a generative AI-powered tool that allows users to enter natural language questions and generates responses from content in our platform. Unlike most other generative AI products, Answers always provides links to the original sources its responses are based on. These citations are tracked and used to calculate author royalties and payments to publishing partners.

So the obvious question is: What are our users asking? One might guess that the questions in Answers would be similar to the search terms used on the platform. (At this point, Answers and search are distinct from each other.) That guess is partly right—and partly wrong. There are some obvious differences. Common search terms include book titles, author names, and even ISBNs; titles and author names rarely appear in Answers. The most common searches are for single words, such as “Python” or “Java.” (The average length of the top 5,000 searches in September 2024 was two words, for instance.) There are few single word questions in Answers (though there are some); most questions are well-formed sentences like “How many ways can you create a string object in Java?” (The average question length was nine words.)

To analyze the questions from O’Reilly Answers, we essentially turned them back into single-word questions. First, we eliminated questions from a “question bank” that we created to prime the pump, as it were: Rather than requiring users to write a new question, we offered a list of prewritten queries they could click on. While there’s undoubtedly some useful signal in how the question bank was used, we were more interested in what users asked of their own volition. From the user-written questions, we created a big “bag of words,” sorted them by frequency, and eliminated stopwords. We included a lot of stopwords that aren’t in most lists: words like “data” (what does that mean by itself?) and “chapter” (yes, you can ask about a chapter in a book, but that doesn’t tell us much).

With that background in mind, what were the most common words in Answers and in searches? In order:

*Answers*	*Search Queries*
Python	Python
Java	Machine learning
Management	Kubernetes
Key	Java
Model	Rust
Security	React
File	AWS
Architecture	CISSP
AI	C++
System	Linux
Service	Docker
Project	SQL
Learning	JavaScript

There’s an obvious difference between these two lists. The Answers list consists mostly of words that could be part of longer questions. The Search list is made up of topics and skills about which one might want information. That’s hardly surprising or insightful. We’ve said most searches on the platform are single-word searches, which means that those words have to be stand-alone skills or topics, like Python or Java. Likewise, Answers was built to allow users to ask more detailed, in-depth questions and get focused answers from the content on our platform—so rather than seeing single word searches, we’re seeing common words from longer questions. Maybe that’s a self-fulfilling prophecy, but it’s also showing that Answers is working the way we intended.

There’s a little more signal here. Python and Java are the two top programming languages on both lists, but if we look at search queries, machine learning and Kubernetes are sandwiched between the two languages. That may just be a result of our users’ experiences with services like ChatGPT. Programmers quickly learned that they can get reasonable answers to questions about Java and Python, and the prompts don’t have to be very complex. My personal favorite is “How do you flatten a list of lists in Python?,” which can be answered by most chatbots correctly but isn’t meaningful to our search engine.

Kubernetes raises a different question: Why is it the third-most-common search engine query but doesn’t appear among the top words on Answers? (It’s the 90th-most-common word on Answers, though the actual rank isn’t meaningful.) While Kubernetes is a topic that’s amenable to precise questions, it’s a complex tool, and coming up with precise prompts is difficult; writing a good question probably requires a good understanding of your IT infrastructure. You might need to understand how to solve your problem before you can ask a good question about how to solve your problem. A search engine doesn’t face problems like this. It doesn’t need additional information to return a list of resources.

Then what about words like Rust and Linux, which are high on the list of common searches, but not in the top 13 for Answers? It’s relatively easy to come up with specific questions about either of these—or, for that matter, about SQL, AWS, or React. SQL, AWS, and Linux are reasonably close to the top of the Answers word list. If we just concern ourselves with the order in which words appear, things start to fall into place: AWS (and cloud) follow learning; they are followed by Linux, followed by SQL. We’re not surprised that there are few questions about CISSP on Answers; it’s a certification exam, so users are more likely to want test prep material than to ask specific questions. Rust and React are still outliers, though; it’s easy to ask precise and specific questions about either of them. Rust is still unfamiliar to many of our users—could the explanation be that our customers want to learn Rust as a whole rather than ask specific questions that might only occur to someone who’s already learned the language? But if you accept that, React still remains an outlier. We may know the answers next year, at which time we’ll have a much longer track record with Answers.

The Coming Year

That wraps up last year. What will we see this year? We’ve given hints throughout this report. Let’s bring it all together.

AI dominated the news for 2024. It will continue to do so in 2025, despite some disillusionment. For the most part, those who are disillusioned aren’t the people making decisions about what products to build. While concern about jobs is understandable in a year that’s seen significant layoffs, we don’t believe that AI is “coming for your job.” However, we do believe that the future will belong to those who learn how to use AI effectively—and that AI will have a profound impact on every profession, not just IT and not just “knowledge workers.” Using AI effectively isn’t just about coming up with clever prompts so you can copy and paste an answer. If all you can do is prompt, copy, and paste, you’re about to become superfluous. You need to figure out how to work with AI to create something that’s better than what the AI could do by itself. Training employees to use AI effectively is one of the best things a company can do to prepare for an AI-driven future. Companies that don’t invest in training will inevitably fall behind.

In the coming year, will companies build AI applications on top of the giant foundation models like GPT-4, Claude, and Gemini? Or will they build on top of smaller open models, many of which are based on Meta’s Llama? And in the latter case, will they run the models on-premises (which includes the use of hosting and colocation providers), or will they rent use of these open AI models as a service from various providers? In the coming year, watch carefully what happens with the small open models. They already deliver performance almost as good as the foundation models and will undoubtedly be the basis for many AI applications. And we suspect that most companies will run these models in the cloud.

Security is the other significant growth area. Companies are waking up to the need to secure their data before their reputations—and their bottom lines—are compromised. Waking up has been a long, slow process that has sunk the careers of many CEOs and CIOs, but it’s happening. Our users are studying to gain security certifications. We see companies investing in governance and putting in company-wide policies to maintain security. In this respect, AI cuts both ways. It’s both a tool and a danger. It’s a tool because security professionals need to watch over huge streams of data, looking for the anomalies that signal an attack; it’s a tool because AI can digest sources of information about new threats and vulnerabilities; it’s a tool because AI can automate routine tasks like report generation. But it’s also a danger. AI-enabled applications increase an organization’s threat surface by introducing new vulnerabilities, like prompt injection, that we’re only now learning how to mitigate. We haven’t yet seen a high-profile attack against AI that compromised an organization’s ability to do business, but that will certainly happen eventually—maybe in 2025.

Whatever happens this year, AI will be at the center. Everyone will need to learn how to use AI effectively. AI will inevitably reshape all of our professions, but we don’t yet know how; we’re only starting to get glimpses. Is that exciting or terrifying? Both.

Footnotes

The definition of “open” and “open source” for AI is still controversial. Some open models don’t include access to weights, and many don’t include access to training data.

Beyond Skills: Unlocking the Full Potential of Data Scientists

Eric Colson — Tue, 29 Oct 2024 10:25:04 +0000

Modern organizations regard data as a strategic asset that drives efficiency, enhances decision making, and creates new value for customers. Across the organization—product management, marketing, operations, finance, and more—teams are overflowing with ideas on how data can elevate the business. To bring these ideas to life, companies are eagerly hiring data scientists for their technical skills (Python, statistics, machine learning, SQL, etc.).

Despite this enthusiasm, many companies are significantly underutilizing their data scientists. Organizations remain narrowly focused on employing data scientists to execute preexisting ideas, overlooking the broader value they bring. Beyond their skills, data scientists possess a unique perspective that allows them to come up with innovative business ideas of their own—ideas that are novel, strategic, or differentiating and are unlikely to come from anyone but a data scientist.

Misplaced Focus on Skills and Execution

Sadly, many companies behave in ways that suggest they are uninterested in the ideas of data scientists. Instead, they treat data scientists as a resource to be used for their skills alone. Functional teams provide requirements documents with fully specified plans: “Here’s how you are to build this new system for us. Thank you for your partnership.” No context is provided, and no input is sought—other than an estimate for delivery. Data scientists are further inundated with ad hoc requests for tactical analyses or operational dashboards.¹ The backlog of requests grows so large that the work queue is managed through Jira-style ticketing systems, which strip the requests of any business context (e.g., “get me the top products purchased by VIP customers”). One request begets another,² creating a Sisyphean endeavor that leaves no time for data scientists to think for themselves. And then there’s the myriad of opaque requests for data pulls: “Please get me this data so I can analyze it.” This is marginalizing—like asking Steph Curry to pass the ball so you can take the shot. It’s not a partnership; it’s a subordination that reduces data science to a mere support function, executing ideas from other teams. While executing tasks may produce some value, it won’t tap into the full potential of what data scientists truly have to offer.

It’s the Ideas

The untapped potential of data scientists lies not in their ability to execute requirements or requests but in their ideas for transforming a business. By “ideas” I mean new capabilities or strategies that can move the business in better or new directions—leading to increased³ revenue, profit, or customer retention while simultaneously providing a sustainable competitive advantage (i.e., capabilities or strategies that are difficult for competitors to replicate). These ideas often take the form of machine learning algorithms that can automate decisions within a production system.⁴ For example, a data scientist might develop an algorithm to better manage inventory by optimally balancing overage and underage costs. Or they might create a model that detects hidden customer preferences, enabling more effective personalization. If these sound like business ideas, that’s because they are—but they’re not likely to come from business teams. Ideas like these typically emerge from data scientists, whose unique cognitive repertoires and observations in the data make them well-suited to uncovering such opportunities.

Ideas That Leverage Unique Cognitive Repertoires

A cognitive repertoire is the range of tools, strategies, and approaches an individual can draw upon for thinking, problem-solving, or processing information (Page 2017). These repertoires are shaped by our backgrounds—education, experience, training, and so on. Members of a given functional team often have similar repertoires due to their shared backgrounds. For example, marketers are taught frameworks like SWOT analysis and ROAS, while finance professionals learn models such as ROIC and Black-Scholes.

Data scientists have a distinctive cognitive repertoire. While their academic backgrounds may vary—ranging from statistics to computer science to computational neuroscience—they typically share a quantitative tool kit. This includes frameworks for widely applicable problems, often with accessible names like the “newsvendor model,” the “traveling salesman problem,” the “birthday problem,” and many others. Their tool kit also includes knowledge of machine learning algorithms⁵ like neural networks, clustering, and principal components, which are used to find empirical solutions to complex problems. Additionally, they include heuristics such as big O notation, the central limit theorem, and significance thresholds. All of these constructs can be expressed in a common mathematical language, making them easily transferable across different domains, including business—perhaps especially business.

The repertoires of data scientists are particularly relevant to business innovation since, in many industries,⁶ the conditions for learning from data are nearly ideal in that they have high-frequency events, a clear objective function,⁷ and timely and unambiguous feedback. Retailers have millions of transactions that produce revenue. A streaming service sees millions of viewing events that signal customer interest. And so on—millions or billions of events with clear signals that are revealed quickly. These are the units of induction that form the basis for learning, especially when aided by machines. The data science repertoire, with its unique frameworks, machine learning algorithms, and heuristics, is remarkably geared for extracting knowledge from large volumes of event data.

Ideas are born when cognitive repertoires connect with business context. A data scientist, while attending a business meeting, will regularly experience pangs of inspiration. Her eyebrows raise from behind her laptop as an operations manager describes an inventory perishability problem, lobbing the phrase “We need to buy enough, but not too much.” “Newsvendor model,” the data scientist whispers to herself. A product manager asks, “How is this process going to scale as the number of products increases?” The data scientist involuntarily scribbles “O(N²)” on her notepad, which is big O notation to indicate that the process will scale superlinearly. And when a marketer brings up the topic of customer segmentation, bemoaning, “There are so many customer attributes. How do we know which ones are most important?,” the data scientist sends a text to cancel her evening plans. Instead, tonight she will eagerly try running principal components analysis on the customer data.⁸

No one was asking for ideas. This was merely a tactical meeting with the goal of reviewing the state of the business. Yet the data scientist is practically goaded into ideating. “Oh, oh. I got this one,” she says to herself. Ideation can even be hard to suppress. Yet many companies unintentionally seem to suppress that creativity. In reality our data scientist probably wouldn’t have been invited to that meeting. Data scientists are not typically invited to operating meetings. Nor are they typically invited to ideation meetings, which are often limited to the business teams. Instead, the meeting group will assign the data scientist Jira tickets of tasks to execute. Without the context, the tasks will fail to inspire ideas. The cognitive repertoire of the data scientist goes unleveraged—a missed opportunity to be sure.

Ideas Born from Observation in the Data

Beyond their cognitive repertoires, data scientists bring another key advantage that makes their ideas uniquely valuable. Because they are so deeply immersed in the data, data scientists discover unforeseen patterns and insights that inspire novel business ideas. They are novel in the sense that no one would have thought of them—not product managers, executives, marketers—not even a data scientist for that matter. There are many ideas that cannot be conceived of but rather are revealed by observation in the data.

Company data repositories (data warehouses, data lakes, and the like) contain a primordial soup of insights lying fallow in the information. As they do their work, data scientists often stumble upon intriguing patterns—an odd-shaped distribution, an unintuitive relationship, and so forth. The surprise finding piques their curiosity, and they explore further.

Imagine a data scientist doing her work, executing on an ad hoc request. She is asked to compile a list of the top products purchased by a particular customer segment. To her surprise, the products bought by the various segments are hardly different at all. Most products are bought at about the same rate by all segments. Weird. The segments are based on profile descriptions that customers opted into, and for years the company had assumed them to be meaningful groupings useful for managing products. “There must be a better way to segment customers,” she thinks. She explores further, launching an informal, impromptu analysis. No one is asking her to do this, but she can’t help herself. Rather than relying on the labels customers use to describe themselves, she focuses on their actual behavior: what products they click on, view, like, or dislike. Through a combination of quantitative techniques—matrix factorization and principal component analysis—she comes up with a way to place customers into a multidimensional space. Clusters of customers adjacent to one another in this space form meaningful groupings that better reflect customer preferences. The approach also provides a way to place products into the same space, allowing for distance calculations between products and customers. This can be used to recommend products, plan inventory, target marketing campaigns, and many other business applications. All of this is inspired from the surprising observation that the tried-and-true customer segments did little to explain customer behavior. Solutions like this have to be driven by observation since, absent the data saying otherwise, no one would have thought to inquire about a better way to group customers.

As a side note, the principal component algorithm that the data scientists used belongs to a class of algorithms called “unsupervised learning,” which further exemplifies the concept of observation-driven insights. Unlike “supervised learning,” in which the user instructs the algorithm what to look for, an unsupervised learning algorithm lets the data describe how it is structured. It is evidence based; it quantifies and ranks each dimension, providing an objective measure of relative importance. The data does the talking. Too often we try to direct the data to yield to our human-conceived categorization schemes, which are familiar and convenient to us, evoking visceral and stereotypical archetypes. It’s satisfying and intuitive but often flimsy and fails to hold up in practice.

Examples like this are not rare. When immersed in the data, it’s hard for the data scientists not to come upon unexpected findings. And when they do, it’s even harder for them to resist further exploration—curiosity is a powerful motivator. Of course, she exercised her cognitive repertoire to do the work, but the entire analysis was inspired by observation of the data. For the company, such distractions are a blessing, not a curse. I’ve seen this sort of undirected research lead to better inventory management practices, better pricing structures, new merchandising strategies, improved user experience designs, and many other capabilities—none of which were asked for but instead were discovered by observation in the data.

Isn’t discovering new insights the data scientist’s job? Yes—that’s exactly the point of this article. The problem arises when data scientists are valued only for their technical skills. Viewing them solely as a support team limits them to answering specific questions, preventing deeper exploration of insights in the data. The pressure to respond to immediate requests often causes them to overlook anomalies, unintuitive results, and other potential discoveries. If a data scientist were to suggest some exploratory research based on observations, the response is almost always, “No, just focus on the Jira queue.” Even if they spend their own time—nights and weekends—researching a data pattern that leads to a promising business idea, it may still face resistance simply because it wasn’t planned or on the roadmap. Roadmaps tend to be rigid, dismissing new opportunities, even valuable ones. In some organizations, data scientists may pay a price for exploring new ideas. Data scientists are often judged by how well they serve functional teams, responding to their requests and fulfilling short-term needs. There is little incentive to explore new ideas when doing so detracts from a performance review. In reality, data scientists frequently find new insights in spite of their jobs, not because of them.

Ideas That Are Different

These two things—their cognitive repertoires and observations from the data—make the ideas that come from data scientists uniquely valuable. This is not to suggest that their ideas are necessarily better than those from the business teams. Rather, their ideas are different from those of the business teams. And being different has its own set of benefits.

Having a seemingly good business idea doesn’t guarantee that the idea will have a positive impact. Evidence suggests that most ideas will fail. When properly measured for causality,⁹ the vast majority of business ideas either fail to show any impact at all or actually hurt metrics. (See some statistics here.) Given the poor success rates, innovative companies construct portfolios of ideas in the hopes that at least a few successes will allow them to reach their goals. Still savvier companies use experimentation¹⁰ (A/B testing) to try their ideas on small samples of customers, allowing them to assess the impact before deciding to roll them out more broadly.

This portfolio approach, combined with experimentation, benefits from both the quantity and diversity of ideas.¹¹ It’s similar to diversifying a portfolio of stocks. Increasing the number of ideas in the portfolio increases exposure to a positive outcome—an idea that makes a material positive impact on the company. Of course, as you add ideas, you also increase the risk of bad outcomes—ideas that do nothing or even have a negative impact. However, many ideas are reversible—the “two-way door” that Amazon’s Jeff Bezos speaks of (Haden 2018). Ideas that don’t produce the expected results can be pruned after being tested on a small sample of customers, greatly mitigating the impact, while successful ideas can be rolled out to all relevant customers, greatly amplifying the impact.

So, adding ideas to the portfolio increases exposure to upside without a lot of downside—the more, the better.¹² However, there is an assumption that the ideas are independent (uncorrelated). If all the ideas are similar, then they may all succeed or fail together. This is where diversity comes in. Ideas from different groups will leverage divergent cognitive repertoires and different sets of information. This makes them different and less likely to be correlated with each other, producing more varied outcomes. For stocks, the return on a diverse portfolio will be the average of the returns for the individual stocks. However, for ideas, since experimentation lets you mitigate the bad ones and amplify the good ones, the return of the portfolio can be closer to the return of the best idea (Page 2017).

In addition to building a portfolio of diverse ideas, a single idea can be significantly strengthened through collaboration between data scientists and business teams.¹³ When they work together, their combined repertoires fill in each other’s blind spots (Page 2017).¹⁴ By merging the unique expertise and insights from multiple teams, ideas become more robust, much like how diverse groups tend to excel in trivia competitions. However, organizations must ensure that true collaboration happens at the ideation stage rather than dividing responsibilities such that business teams focus solely on generating ideas and data scientists are relegated to execution.

Cultivating Ideas

Data scientists are much more than a skilled resource for executing existing ideas; they are a wellspring of novel, innovative thinking. Their ideas are uniquely valuable because (1) their cognitive repertoires are highly relevant to businesses with the right conditions for learning, (2) their observations in the data can lead to novel insights, and (3) their ideas differ from those of business teams, adding diversity to the company’s portfolio of ideas.

However, organizational pressures often prevent data scientists from fully contributing their ideas. Overwhelmed with skill-based tasks and deprived of business context, they are incentivized to merely fulfill the requests of their partners. This pattern exhausts the team’s capacity for execution while leaving their cognitive repertoires and insights largely untapped.

Here are some suggestions that organizations can follow to better leverage data scientists and shift their roles from mere executors to active contributors of ideas:

Give them context, not tasks. Providing data scientists with tasks or fully specified requirements documents will get them to do work, but it won’t elicit their ideas. Instead, give them context. If an opportunity is already identified, describe it broadly through open dialogue, allowing them to frame the problem and propose solutions. Invite data scientists to operational meetings where they can absorb context, which may inspire new ideas for opportunities that haven’t yet been considered.
Create slack for exploration. Companies often completely overwhelm data scientists with tasks. It may seem paradoxical, but keeping resources 100% utilized is very inefficient.¹⁵ Without time for exploration and unexpected learning, data science teams can’t reach their full potential. Protect some of their time for independent research and exploration, using tactics like Google’s 20% time or similar approaches.
Eliminate the task management queue. Task queues create a transactional, execution-focused relationship with the data science team. Priorities, if assigned top-down, should be given in the form of general, unframed opportunities that need real conversations to provide context, goals, scope, and organizational implications. Priorities might also emerge from within the data science team, requiring support from functional partners, with the data science team providing the necessary context. We don’t assign Jira tickets to product or marketing teams, and data science should be no different.
Hold data scientists accountable for real business impact. Measure data scientists by their impact on business outcomes, not just by how well they support other teams. This gives them the agency to prioritize high-impact ideas, regardless of the source. Additionally, tying performance to measurable business impact¹⁶ clarifies the opportunity cost of low-value ad hoc requests.¹⁷
Hire for adaptability and broad skill sets. Look for data scientists who thrive in ambiguous, evolving environments where clear roles and responsibilities may not always be defined. Prioritize candidates with a strong desire for business impact,¹⁸ who see their skills as tools to drive outcomes, and who excel at identifying new opportunities aligned with broad company goals. Hiring for diverse skill sets enables data scientists to build end-to-end systems, minimizing the need for handoffs and reducing coordination costs—especially critical during the early stages of innovation when iteration and learning are most important.¹⁹
Hire functional leaders with growth mindsets. In new environments, avoid leaders who rely too heavily on what worked in more mature settings. Instead, seek leaders who are passionate about learning and who value collaboration, leveraging diverse perspectives and information sources to fuel innovation.

These suggestions require an organization with the right culture and values. The culture needs to embrace experimentation to measure the impact of ideas and to recognize that many will fail. It needs to value learning as an explicit goal and understand that, for some industries, the vast majority of knowledge has yet to be discovered. It must be comfortable relinquishing the clarity of command-and-control in exchange for innovation. While this is easier to achieve in a startup, these suggestions can guide mature organizations toward evolving with experience and confidence. Shifting an organization’s focus from execution to learning is a challenging task, but the rewards can be immense or even crucial for survival. For most modern firms, success will depend on their ability to harness human potential for learning and ideation—not just execution (Edmondson 2012). The untapped potential of data scientists lies not in their ability to execute existing ideas but in the new and innovative ideas no one has yet imagined.

Footnotes

To be sure, dashboards have value in providing visibility into business operations. However, dashboards are limited in their ability to provide actionable insights. Aggregated data is typically so full of confounders and systemic bias that it is rarely appropriate for decision making. The resources required to build and maintain dashboards need to be balanced against other initiatives the data science team could be doing that might produce more impact.
It’s a well-known phenomenon that data-related inquiries tend to evoke more questions than they answer.
I used “increased” in place of “incremental” since the latter is associated with “small” or “marginal.” The impact from data science initiatives can be substantial. I use the term here to indicate the impact as an improvement—though without a fundamental change to the existing business model.
As opposed to data used for human consumption, such as short summaries or dashboards, which do have value in that they inform our human workers but are typically limited in direct actionability.
I resist referring to knowledge of the various algorithms as skills since I feel it’s more important to emphasize their conceptual appropriateness for a given situation versus the pragmatics of training or implementing any particular approach.
Industries such as ecommerce, social networks, and streaming content have favorable conditions for learning in comparison to fields like medicine, where the frequency of events is much lower and the time to feedback is much longer. Additionally, in many aspects of medicine, the feedback can be very ambiguous.
Typically revenue, profit, or user retention. However, it can be challenging for a company to identify a single objective function.
Voluntary tinkering is common among data scientists and is driven by curiosity, the desire for impact, the desire for experience, etc.
Admittedly, the data available on the success rates of business ideas is likely biased in that most of it comes from tech companies experimenting with online services. However, at least anecdotally, the low success rates seem to be consistent across other types of business functions, industries, and domains.
Not all ideas are conducive to experimentation due to unattainable sample size, inability to isolate experimentation arms, ethical concerns, or other factors.
I purposely exclude the notion of “quality of idea” since, in my experience, I’ve seen little evidence that an organization can discern the “better” ideas within the pool of candidates.
Often, the real cost of developing and trying an idea is the human resources—engineers, data scientists, PMs, designers, etc. These resources are fixed in the short term and act as a constraint to the number of ideas that can be tried in a given time period.
See Duke University professor Martin Ruef, who studied the coffee house model of innovation (coffee house is analogy for bringing diverse people together to chat). Diverse networks are 3x more innovative than linear networks (Ruef 2002).
The data scientists will appreciate the analogy to ensemble models, where errors from individual models can offset each other.
See The Goal, by Eliyahu M. Goldratt, which articulates this point in the context of supply chains and manufacturing lines. Maintaining resources at a level above the current needs enables the firm to take advantage of unexpected surges in demand, which more than pays for itself. The practice works for human resources as well.
Causal measurement via randomized controlled trials is ideal, to which algorithmic capabilities are very amenable.
Admittedly, the value of an ad hoc request is not always clear. But there should be a high bar to consume data science resources. A Jira ticket is far too easy to submit. If a topic is important enough, it will merit a meeting to convey context and opportunity.
If you are reading this and find yourself skeptical that your data scientist who spends his time dutifully responding to Jira tickets is capable of coming up with a good business idea, you are likely not wrong. Those comfortable taking tickets are probably not innovators or have been so inculcated to a support role that they have lost the will to innovate.
As the system matures, more specialized resources can be added to make the system more robust. This can create a scramble. However, by finding success first, we are more judicious with our precious development resources.

References

Page, Scott E. 2017. The Diversity Bonus. Princeton University Press.
Edmondson, Amy C. 2012. Teaming: How Organizations Learn, Innovate, and Compete in the Knowledge Economy. Jossey-Bass.
Haden, Jeff. 2018. “Amazon Founder Jeff Bezos: This Is How Successful People Make Such Smart Decisions.” Inc., December 3. https://www.inc.com/jeff-haden/amazon-founder-jeff-bezos-this-is-how-successful-people-make-such-smart-decisions.html.
Ruef, Martin. 2002. “Strong Ties, Weak Ties and Islands: Structural and Cultural Predictors of Organizational Innovation.” Industrial and Corporate Change 11 (3): 427–449. https://doi.org/10.1093/icc/11.3.427.

Universal API Access from Postgres and SQLite

Jon Udell — Tue, 27 Feb 2024 13:39:58 +0000

In “SQL: The Universal Solvent for REST APIs” we saw how Steampipe’s suite of open source plug-ins that translate REST API calls directly into SQL tables. These plug-ins were, until recently, tightly bound to the open source engine and to the instance of Postgres that it launches and controls. That led members of the Steampipe community to ask: “Can we use the plug-ins in our own Postgres databases?” Now the answer is yes—and more—but let’s focus on Postgres first.

NOTE: Each Steampipe plugin ecosystem is now also a standalone foreign-data-wrapper extension for Postgres, a virtual-table extension for SQLite, and an export tool.

Using a Steampipe Plugin as a Standalone Postgres Foreign Data Wrapper (FDW)

Visit Steampipe downloads to find the installer for your OS, and run it to acquire the Postgres FDW distribution of a plugin—in this case, the GitHub plugin. It’s one of (currently) 140 plug-ins available on the Steampipe hub. Each plugin provides a set of tables that map API calls to database tables—in the case of the GitHub plugin, 55 such tables. Each table can appear in a FROM or JOIN clause; here’s a query to select columns from the GitHub issue, filtering on a repository and author.

select
  state,
  updated_at,
  title,
  url
from
  github_issue
where
  repository_full_name = 'turbot/steampipe'
  and author_login = 'judell'
order by
  updated_at desc

If you’re using Steampipe, you can install the GitHub plugin like this:

steampipe plugin install github

then run the query in the Steampipe CLI or in any Postgres client that can connect to Steampipe’s instance of Postgres.

But if you want to do the same thing in your own instance of Postgres, you can install the plugin in a different way.

$ sudo /bin/sh -c "$(
   curl -fsSL https://steampipe.io/install/postgres.sh)"
Enter the plugin name: github
Enter the version (latest): 

Discovered:
- PostgreSQL version:   14
- PostgreSQL location:  /usr/lib/postgresql/14
- Operating system:     Linux
- System architecture:  x86_64

Based on the above, steampipe_postgres_github.pg14.linux_amd64.tar.gz
will be downloaded, extracted and installed at: /usr/lib/postgresql/14

Proceed with installing Steampipe PostgreSQL FDW for version 14 at
 /usr/lib/postgresql/14?
- Press 'y' to continue with the current version.
- Press 'n' to customize your PostgreSQL installation directory 
and select a different version. (Y/n): 


Downloading steampipe_postgres_github.pg14.linux_amd64.tar.gz...
###############################################################
############################ 100.0%
steampipe_postgres_github.pg14.linux_amd64/
steampipe_postgres_github.pg14.linux_amd64/steampipe_postgres_
github.so
steampipe_postgres_github.pg14.linux_amd64/steampipe_postgres_
github.control
steampipe_postgres_github.pg14.linux_amd64/steampipe_postgres_
github--1.0.sql
steampipe_postgres_github.pg14.linux_amd64/install.sh
steampipe_postgres_github.pg14.linux_amd64/README.md

Download and extraction completed.

Installing steampipe_postgres_github in /usr/lib/postgresql/14...

Successfully installed steampipe_postgres_github extension!

Files have been copied to:
- Library directory: /usr/lib/postgresql/14/lib
- Extension directory: /usr/share/postgresql/14/extension/

Now connect to your server as usual, using psql or another client, most typically as the postgres user. Then run these commands, which are typical for any Postgres foreign data wrapper. As with all Postgres extensions, you start like this:

CREATE EXTENSION steampipe_postgres_fdw_github;

To use a foreign data wrapper, you first create a server:

CREATE SERVER steampipe_github FOREIGN DATA WRAPPER
steampipe_postgres_github OPTIONS (config 'token="ghp_..."');

Use OPTIONS to configure the extension to use your GitHub access token. (Alternatively, the standard environment variables used to configure a Steampipe plugin—it’s just GITHUB_TOKEN in this case—will work if you set them before starting your instance of Postgres.)

The tables provided by the extension will live in a schema, so define one:

CREATE SCHEMA github;

Now import the schema defined by the foreign server into the local schema you just created:

IMPORT FOREIGN SCHEMA github FROM SERVER steampipe_github INTO github;

Now run a query!

The foreign tables provided by the extension live in the github schema, so by default you’ll refer to tables like github.github_my_repository. If you set search_path = 'github', though, the schema becomes optional and you can write queries using unqualified table names. Here’s a query we showed last time. It uses the GitHub_search_repository which encapsulates the GitHub API for searching repositories.

Suppose you’re looking for repos related to PySpark. Here’s a query to find repos whose names match “pyspark” and report a few metrics to help you gauge activity and popularity.

select
  name_with_owner,
  updated_at,     -- how recently updated?
  stargazer_count -- how many people starred the repo?
from 
  github_search_repository 
where 
  query = 'pyspark in:name' 
order by
  stargazer_count desc
limit 10;
+---------------------------------------+------------+---------------+
|name_with_owner                        |updated_at  |stargazer_count|
+---------------------------------------+------------+---------------+
| AlexIoannides/pyspark-example-project | 2024-02-09 | 1324          |
| mahmoudparsian/pyspark-tutorial       | 2024-02-11 | 1077          |
| spark-examples/pyspark-examples       | 2024-02-11 | 1007          |
| palantir/pyspark-style-guide          | 2024-02-12 | 924           |
| pyspark-ai/pyspark-ai                 | 2024-02-12 | 791           |
| lyhue1991/eat_pyspark_in_10_days      | 2024-02-01 | 719           |
| UrbanInstitute/pyspark-tutorials      | 2024-01-21 | 400           |
| krishnaik06/Pyspark-With-Python       | 2024-02-11 | 400           |
| ekampf/PySpark-Boilerplate            | 2024-02-11 | 388           |
| commoncrawl/cc-pyspark                | 2024-02-12 | 361           |
+---------------------------------------+------------+---------------+

If you have a lot of repos, the first run of that query will take a few seconds. The second run will return results instantly, though, because the extension includes a powerful and sophisticated cache.

And that’s all there is to it! Every Steampipe plugin is now also a foreign data wrapper that works exactly like this one. You can load multiple extensions in order to join across APIs. Of course, you can join any of these API-sourced foreign tables with your own Postgres tables. And to save the results of any query, you can prepend “create table NAME as” or “create materialized view NAME as” to a query to persist results as a table or view.

Using a Steampipe Plugin as a SQLite Extension That Provides Virtual Tables

Visit Steampipe downloads to find the installer for your OS and run it to acquire the SQLite distribution of the same plugin.

$ sudo /bin/sh -c "$(curl -fsSL https://steampipe.io/install/sqlite.sh)"
Enter the plugin name: github
Enter version (latest): 
Enter location (current directory): 

Downloading steampipe_sqlite_github.linux_amd64.tar.gz...
############################################################
################ 100.0%
steampipe_sqlite_github.so

steampipe_sqlite_github.linux_amd64.tar.gz downloaded and 
extracted successfully at /home/jon/steampipe-sqlite.

Here’s the setup, and you can place this code in ~/.sqliterc if you want to run it every time you start sqlite.

.load /home/jon/steampipe-sqlite/steampipe_sqlite_github.so

select steampipe_configure_github('
  token="ghp_..."
');

Now you can run the same query as above. Here, too, the results are cached, so a second run of the query will be instant.

What about the differences between Postgres-flavored and SQLite-flavored SQL? The Steampipe hub is your friend! For example, here are Postgres and SQLite variants of a query that accesses a field inside a JSON column in order to tabulate the languages associated with your gists.

Postgres

SQLite

The github_my_gist table reports details about gists that belong to the GitHub user who is authenticated to Steampipe. The language associated with each gist lives in a JSONB column called files, which contains a list of objects like this.

{
   "size": 24541,
   "type": "text/markdown",
   "raw_url": "https://gist.githubusercontent.com/judell/49d66ca2a5d2a3b
   "filename": "steampipe-readme-update.md",
   "language": "Markdown"
}

The functions needed to project that list as rows differ: in Postgres you use jsonb_array_elements and in SQLite it’s json_each.

As with Postgres extensions, you can load multiple SQLite extensions in order to join across APIs. You can join any of these API-sourced foreign tables with your own SQLite tables. And you can prepend create table NAME as to a query to persist results as a table.

Using a Steampipe Plugin as a Standalone Export Tool

Visit Steampipe downloads to find the installer for your OS, and run it to acquire the export distribution of a plugin—again, we’ll illustrate using the GitHub plugin.

$ sudo /bin/sh -c "$(curl -fsSL https://steampipe.io/install/export.sh)"
Enter the plugin name: github
Enter the version (latest): 
Enter location (/usr/local/bin): 
Created temporary directory at /tmp/tmp.48QsUo6CLF.

Downloading steampipe_export_github.linux_amd64.tar.gz...
##########################################################
#################### 100.0%
Deflating downloaded archive
steampipe_export_github
Installing
Applying necessary permissions
Removing downloaded archive
steampipe_export_github was installed successfully to
/usr/local/bin
$ steampipe_export_github -h
Export data using the github plugin.

Find detailed usage information including table names, 
column names, and examples at the Steampipe Hub:
https://hub.steampipe.io/plugins/turbot/github

Usage:
  steampipe_export_github TABLE_NAME [flags]

Flags:
      --config string       Config file data
  -h, --help                help for steampipe_export_github
      --limit int           Limit data
      --output string       Output format: csv, json or jsonl 
(default "csv")
      --select strings      Column data to display
      --where stringArray   where clause data

There’s no SQL engine in the picture here; this tool is purely an exporter. To export all your gists to a JSON file:

steampipe_export_github github_my_gist --output json > gists.json

To select only some columns and export to a CSV file:

steampipe_export_github github_my_gist --output csv --select
 "description,created_at,html_url" > gists.csv

You can use --limit to limit the rows returned and --where to filter them, but mostly you’ll use this tool to quickly and easily grab data that you’ll massage elsewhere, for example, in a spreadsheet.

Tap into the Steampipe Plugin Ecosystem

Steampipe plug-ins aren’t just raw interfaces to underlying APIs. They use tables to model those APIs in useful ways. For example, the github_my_repository table exemplifies a design pattern that applies consistently across the suite of plug-ins. From the GitHub plugin’s documentation:

You can own repositories individually, or you can share ownership of repositories with other people in an organization. The github_my_repository table will list repos that you own, that you collaborate on, or that belong to your organizations. To query ANY repository, including public repos, use the github_repository table.

Other plug-ins follow the same pattern. For example, the Microsoft 365 plugin provides both microsoft_my_mail_message and microsoft_mail_message, and the plugin provides googleworkspace_my_gmail_message and googleworkspace_gmail. Where possible, plug-ins consolidate views of resources from the perspective of an authenticated user.

While plug-ins typically provide tables with fixed schemas, that’s not always the case. Dynamic schemas, implemented by the Airtable, CSV, Kubernetes, and Salesforce plug-ins (among others) are another key pattern. Here’s a CSV example using a standalone Postgres FDW.

IMPORT FOREIGN SCHEMA csv FROM SERVER steampipe_csv INTO csv 
 OPTIONS(config 'paths=["/home/jon/csv"]');

Now all the .csv files in /home/jon/csv will automagically be Postgres foreign tables. Suppose you keep track of valid owners of EC2 instances in a file called ec2_owner_tags. Here’s a query against the corresponding table.

select * from csv.ec2_owner_tags;
     owner      |            _ctx
----------------+----------------------------
 Pam Beesly     | {"connection_name": "csv"}
 Dwight Schrute | {"connection_name": "csv"}

You could join that table with the AWS plugin’s aws_ec2_instance table to report owner tags on EC2 instances that are (or are not) listed in the CSV file.

select 
    ec2.owner,
    case 
        when csv.owner is null then 'false'
        else 'true'
    end as is_listed
from 
    (select distinct tags ->> 'owner' as owner 
     from aws.aws_ec2_instance) ec2
left join 
    csv.ec2_owner_tags csv on ec2.owner = csv.owner;
     owner      | is_listed
----------------+-----------
 Dwight Schrute | true
 Michael Scott  | false

Across the suite of plug-ins there are more than 2,300 predefined fixed-schema tables that you can use in these ways, plus an unlimited number of dynamic tables. And new plug-ins are constantly being added by Turbot and by Steampipe’s open source community. You can tap into this ecosystem using Steampipe or Turbot Pipes, from your own Postgres or SQLite database, or directly from the command line.

Technology Trends for 2024

Mike Loukides — Thu, 25 Jan 2024 11:04:43 +0000

This has been a strange year. While we like to talk about how fast technology moves, internet time, and all that, in reality the last major new idea in software architecture was microservices, which dates to roughly 2015. Before that, cloud computing itself took off in roughly 2010 (AWS was founded in 2006); and Agile goes back to 2000 (the Agile Manifesto dates back to 2001, Extreme Programming to 1999). The web is over 30 years old; the Netscape browser appeared in 1994, and it wasn’t the first. We think the industry has been in constant upheaval, but there have been relatively few disruptions: one every five years, if that.

2023 was one of those rare disruptive years. ChatGPT changed the industry, if not the world. We’re skeptical about things like job displacement, at least in technology. But AI is going to bring changes to almost every aspect of the software industry. What will those changes be? We don’t know yet; we’re still at the beginning of the story. In this report about how people are using O’Reilly’s learning platform, we’ll see how patterns are beginning to shift.

Just a few notes on methodology: This report is based on O’Reilly’s internal “Units Viewed” metric. Units Viewed measures the actual usage of content on our platform. The data used in this report covers January through November in 2022 and 2023. Each graph is scaled so that the topic with the greatest usage is 1. Therefore, the graphs can’t be compared directly to each other.

Remember that these “units” are “viewed” by our users, who are largely professional software developers and programmers. They aren’t necessarily following the latest trends. They’re solving real-world problems for their employers. And they’re picking up the skills they need to advance in their current positions or to get new ones. We don’t want to discount those who use our platform to get up to speed on the latest hot technology: that’s how the industry moves forward. But to understand usage patterns, it’s important to realize that every company has its own technology stacks, and that those stacks change slowly. Companies aren’t going to throw out 20 years’ investment in PHP so they can adopt the latest popular React framework, which will probably be displaced by another popular framework next year.

Software Development

Most of the topics that fall under software development declined in 2023. What does this mean? Programmers are still writing software; our lives are increasingly mediated by software, and that isn’t going to change.

Software developers are responsible for designing and building bigger and more complex projects than ever. That’s one trend that won’t change: complexity is always “up and to the right.” Generative AI is the wild card: Will it help developers to manage complexity? Or will it add complexity all its own? It’s tempting to look at AI as a quick fix. Who wants to learn about coding practices when you’re letting GitHub Copilot write your code for you? Who wants to learn about design patterns or software architecture when some AI application may eventually do your high-level design? AI is writing low-level code now; as many as 92% of software developers are using it. Whether it will be able to do high-level design is an open question—but as always, that question has two sides: “Will AI do our design work?” is less interesting than “How will AI change the things we want to design?” And the real question that will change our industry is “How do we design systems in which generative AI and humans collaborate effectively?”

Figure 1. Software architecture

Regardless of the answers to these questions, humans will need to understand and specify what needs to be designed. Our data shows that most topics in software architecture and design are down year-over-year. But there are exceptions. While software architecture is down 3.9% (a relatively small decline), enterprise architecture is up 8.9%. Domain-driven design is particularly useful for understanding the behavior of complex enterprise systems; it’s down, but only 2.0%. Use of content about event-driven architecture is relatively small, but it’s up 40%. That change is important because event-driven architecture is a tool for designing large systems that have to ingest data from many different streams in real time. Functional programming, which many developers see as a design paradigm that will help solve the problems of distributed systems, is up 9.8%. So the software development world is changing. It’s shifting toward distributed systems that manage large flows of data in real time. Use of content on topics relevant to that shift is holding its own or growing.

Microservices saw a 20% drop. Many developers expressed frustration with microservices during the year and argued for a return to monoliths. That accounts for the sharp decline—and it’s fair to say that many organizations are paying the price for moving to microservices because it was “the thing to do,” not because they needed the scale or flexibility that microservices can offer. From the start, microservice proponents have argued that the best way to develop microservices is to start with a monolith, then break the monolith into services as it becomes necessary. If implemented poorly, microservices deliver neither scale nor flexibility. Microservices aren’t ideal for new greenfield projects, unless you’re absolutely sure that you need them from the start—and even then, you should think twice. It’s definitely not a technology to implement just to follow the latest fad.

Software developers run hot and cold on design patterns, which declined 16%. Why? It probably depends on the wind or the phase of the moon. Content usage about design patterns increased 13% from 2021 to 2022, so this year’s decline just undoes last year’s gain. It’s possible that understanding patterns seems less important when AI is writing a lot of the code for you. It’s also possible that design patterns seem less relevant when code is already largely written; most programmers maintain existing applications rather than develop new greenfield apps, and few texts about design patterns discuss the patterns that are embedded in legacy applications. But both ways of thinking miss the point. Design patterns are common solutions to common problems that have been observed in practice. Understanding design patterns keeps you from reinventing wheels. Frameworks like React and Spring are important because they implement design patterns. Legacy applications won’t be improved by refactoring existing code just to use some pattern, but design patterns are useful for extending existing software and making it more flexible. And, of course, design patterns are used in legacy code—even code that was written before the term was coined! Patterns are discovered, not “invented”; again, they’re common solutions to problems programmers have been solving since the beginning of programming.

At the same time, whenever there’s a surge of interest in design patterns, there’s a corresponding surge in pattern abuse: managers asking developers how many patterns they used (as if pattern count were a metric for good code), developers implementing FactoryFactoryFactory Factories, and the like. What goes around comes around, and the abuse of design patterns is part of a feedback loop that regulates the use of design patterns.

Programming and Programming Languages

Most of the programming languages we track showed declines in content usage. Before discussing specifics, though, we need to look at general trends. If 92% of programmers are using generative AI to write code and answer questions, then we’d certainly expect a drop in content use. That may or may not be advisable for career development, but it’s a reality that businesses built on training and learning have to acknowledge. But that isn’t the whole story either—and the bigger story leaves us with more questions than answers.

Rachel Stephens provides two fascinating pieces of the puzzle in a recent article on the RedMonk blog, but those pieces don’t fit together exactly. First, she notes the decline in questions asked on Stack Overflow and states (reasonably) that asking a nonjudgmental AI assistant might be a preferable way for beginners to get their questions answered. We agree; we at O’Reilly have built O’Reilly Answers to provide that kind of assistance (and are in the process of a major upgrade that will make it even more useful). But Stack Overflow shows a broad peak in questions from 2014 to 2017, with a sharp decline afterward; the number of questions in 2023 is barely 50% of the peak, and the 20% decline from the January 2023 report to the July report is only somewhat sharper than the previous drops. And there was no generative AI, no ChatGPT, back in 2017 when the decline began. Did generative AI play a role? It would be foolish to say that it didn’t, but it can’t be the whole story.

Stephens points to another anomaly: GitHub pull requests declined roughly 25% from the second half of 2022 to the first half of 2023. Why? Stephens guesses that there was increased GitHub activity during the pandemic and that activity has returned to normal now that we’ve (incorrectly) decided the pandemic is over. Our own theory is that it’s a reaction to GPT models leaking proprietary code and abusing open source licenses; that could cause programmers to be wary of public code repositories. But those are only guesses. This change is apparently not an error in the data. It might be a one-time anomaly, but no one really knows the cause. Something drove down programmer activity on GitHub, and that’s inevitably a part of the background to this year’s data.

So, what does O’Reilly’s data say? As it has been for many years, Python is the most widely used programming language on our platform. This year, we didn’t see an increase; we saw a very small (0.14%) decline. That’s noise; we won’t insult your intelligence by claiming that “flat in a down market” is really a gain. It’s certainly fair to ask whether a language as popular as Python has gathered all the market share that it will get. When you’re at the top of the adoption curve, it’s difficult to go any higher and much easier to drop back. There are always new languages ready to take some of Python’s market share. The most significant change in the Python ecosystem is Microsoft’s integration of Python into Excel spreadsheets, but it’s too early to expect that to have had an effect.

Use of content about Java declined 14%, a significant drop but not out of line with the drop in GitHub activity. Like Python, Java is a mature language and may have nowhere to go but down. It has never been “well loved”; when Java was first announced, people walked out of the doors of the conference room claiming that Java was dead before you could even download the beta. (I was there.) Is it time to dance on Java’s grave? That dance has been going on since 1995, and it hasn’t been right yet.

Figure 2. Programming languages

JavaScript also declined by 3.9%. It’s a small decline and probably not meaningful. TypeScript, a version of JavaScript that adds static typing and type annotations, gained 5.6%. It’s tempting to say that these cancel each other out, but that’s not correct. Usage of TypeScript content is roughly one-tenth the usage of JavaScript content. But it is correct to say that interest in type systems is growing among web developers. It’s also true that an increasing number of junior developers use JavaScript only through a framework like React or Vue. Boot camps and other crash programs often train students in “React,” with little attention on the bigger picture. Developers trained in programs like these may be aware of JavaScript but may not think of themselves as JavaScript developers, and may not be looking to learn more about the language outside of a narrow, framework-defined context.

We see growth in C++ (10%), which is surprising for an old, well-established language. (C++ first appeared in 1985.) At this point in C++’s history, we’d expect it to be a headache for people maintaining legacy code, not a language for starting new projects. Why is it growing? While C++ has long been an important language for game development, there are signs that it’s breaking out into other areas. C++ is an ideal language for embedded systems, which often require software that runs directly on the processor (for example, the software that runs in a smart lightbulb or in the braking system of any modern car). You aren’t going to use Python, Java, or JavaScript for those applications. C++ is also an excellent language for number crunching (Python’s numeric libraries are written in C++), which is increasingly important as artificial intelligence goes mainstream. It has also become the new “must have” language on résumés: knowing C++ proves that you’re tough, that you’re a “serious” programmer. Job anxiety exists—whether or not it’s merited is a different question—and in an environment where programmers are nervous about keeping their current jobs or looking forward to finding a new one, knowing a difficult but widely used language can only be an asset.

Use of content about Rust also increased from 2022 to 2023 (7.8%). Rust is a relatively young language that stresses memory safety and performance. While Rust is considered difficult to learn, the idea that memory safety is baked in makes it an important alternative to languages like C++. Bugs in memory management are a significant source of vulnerabilities, as noted in NIST’s page on “Safer Languages,” and Rust does a good job of enforcing safe memory usage. It’s now used in operating systems (Linux kernel components), tool development, and even enterprise software.

We also saw 9.8% growth in content about functional programming. We didn’t see gains for any of the historical functional programming languages (Haskell, Erlang, Lisp, and Elixir) though; most saw steep declines. In the past decade, most programming languages have added functional features. Newer languages like Rust and Go have had them from the start. And Java has gradually added features like closures in a series of updates. Now programmers can be as functional as they want to be without switching to a new language.

Finally, there are some programming languages that we don’t yet track but that we’re watching with interest. Zig is a simple imperative language that’s designed to be memory safe, like Rust, but relatively easy to learn. Mojo is a superset of Python that’s compiled, not interpreted. It’s designed for high performance, especially for numerical operations. Mojo’s goal is to facilitate AI programming in a single language rather than a combination of Python and some other language (typically C++) that’s used for performance-critical numerical code. Where are these languages going? It will be some years before they reach the level of Rust or Go, but they’re off to a good start.

So what does all this tell us about training and skill development? It’s easy to think that, with Copilot and other tools to answer all your questions, you don’t need to put as much effort into learning new technologies. We all ask questions on Google or Stack Overflow, and now we have other places to get answers. Necessary as that is, the idea that asking questions can replace training is naive. Unlike many who are observing the influence of generative AI on programming, we believe that it will increase the gap between entry-level skills and senior developer skills. Being a senior developer—being a senior anything—requires a kind of fluency that you can’t get just from asking questions. I may never be a fluent user of Python’s pandas library (which I used extensively to write this report); I asked lots of questions, and that has undoubtedly saved me time. But what happens when I need to solve the next problem? The kind of fluency that you need to look at a problem and understand how to solve it doesn’t come from asking simple “How do I do this?” questions. Nor does it preclude asking lots of “I forgot how this function works” questions. That’s why we’ve built O’Reilly Answers, an AI-driven service that finds solutions to questions using content from our platform. But expertise does require developing the intellectual muscle that comes from grappling with problems and solving them yourself rather than letting something else solve them for you. (And that includes forcing yourself to remember all the messy syntax details.) People who think generative AI is a shortcut to expertise (and the job title and salary that expertise merits) are shortchanging themselves.

Artificial Intelligence

In AI, there’s one story and only one story, and that’s the GPT family of models. Usage of content on these models exploded 3,600% in the past year. That explosion is tied to the appearance of ChatGPT in November 2022. But don’t make the mistake of thinking that ChatGPT came out of nowhere. GPT-3 created a big splash when it was released in 2020 (complete with a clumsy web-based interface). GPT-2 appeared in 2019, and the original unnumbered GPT was even earlier. The real innovation in ChatGPT wasn’t the technology itself (though the models behind it represent a significant breakthrough in AI performance); it was packaging the model as a chatbot. That doesn’t mean that the GPT explosion wasn’t real. While our analysis of search trends shows that interest in ChatGPT has peaked among our platform’s users, interest in natural language processing (NLP) showed a 195% increase—and from a much higher starting point.¹ That makes sense, given the more technical nature of our audience. Software developers will be building on top of the APIs for GPT and other language models and are likely less interested in ChatGPT, the web-based chat service. Related topics generative models (900%) and Transformers (325%) also showed huge gains. Prompt engineering, which didn’t exist in 2022, became a significant topic, with roughly the same usage as Transformers. As far as total use, NLP is almost twice GPT. However you want to read the data, this is AI’s big year, largely due to the GPT models and the idea of generative AI.

Figure 3. Artificial intelligence

But don’t assume that the explosion of interest in generative AI meant that other aspects of AI were standing still. Deep learning, the creation and application of neural networks with many layers, is fundamental to every aspect of modern AI. Usage in deep learning content grew 19% in the past year. Reinforcement learning, in which models are trained by giving “rewards” for solving problems, grew 15%. Those gains only look small in comparison to the triple- and quadruple-digit gains we’re seeing in natural language processing. PyTorch, the Python library that has come to dominate programming in machine learning and AI, grew 25%. In recent years, interest in PyTorch has been growing at the expense of TensorFlow, but TensorFlow showed a small gain (1.4%), reversing (or at least pausing) its decline. Interest in two older libraries, scikit-learn and Keras, declined: 25% for scikit-learn and 4.8% for Keras. Keras has largely been subsumed by TensorFlow, while scikit-learn hasn’t yet incorporated the capabilities that would make it a good platform for building generative AI. (An attempt to implement Transformers in scikit-learn appears to be underway at Hugging Face.)

We’ve long said that operations is the elephant in the room for machine learning and artificial intelligence. Building models and developing applications is challenging and fun, but no technology can mature if IT teams can’t deploy, monitor, and manage it. Interest in operations for machine learning (MLOps) grew 14% over the past year. This is solid, substantial growth that only looks small in comparison with topics like generative AI. Again, we’re still in the early stages—generative AI and large language models are only starting to reach production. If anything, this increase probably reflects older applications of AI. There’s a growing ecosystem of startups building tools for deploying and monitoring language models, which are fundamentally different from traditional applications. As companies deploy the applications they’ve been building, MLOps will continue to see solid growth. (More on MLOps when we discuss operations below.)

LangChain is a framework for building generative AI applications around groups of models and databases. It’s often used to implement the retrieval-augmented generation (RAG) pattern, where a user’s prompt is used to look up relevant items in a vector database; those items are then combined with the prompt, generating a new prompt that is sent to the language model. There isn’t much content about LangChain available yet, and it didn’t exist in 2022, but it’s clearly going to become a foundational technology. Likewise, vector databases aren’t yet in our data. We expect that to change next year. They are rather specialized, so we expect usage to be relatively small, unlike products like MySQL—but they will be very important.

AI wasn’t dominated entirely by the work of OpenAI; Meta’s LLaMA and Llama 2 also attracted a lot of attention. The source code for LLaMA was open source, and its weights (parameters) were easily available to researchers. Those weights quickly leaked from “researchers” to the general public, where they jump-started the creation of smaller open source models. These models are much smaller than behemoths like GPT-4. Many of them can run on laptops, and they’re proving ideal for smaller companies that don’t want to rely on Microsoft, OpenAI, or Google to provide AI services. (If you want to run an open source language model on your laptop, try llamafile.) While huge “foundation models” like the GPT family won’t disappear, in the long run open source models like Alpaca and Mistral may prove to be more important to software developers.

It’s easy to think that generative AI is just about software development. It isn’t; its influence extends to just about every field. Our ChatGPT: Possibilities and Pitfalls Superstream was the most widely attended event we’ve ever run. There were over 28,000 registrations, with attendees and sponsors from industries as diverse as pharmaceuticals, logistics, and manufacturing. Attendees included small business owners, sales and marketing personnel, and C-suite executives, along with many programmers and engineers from different disciplines. We’ve also been running courses focused on specific industries: Generative AI for Finance had over 2,000 registrations, and Generative AI for Government over 1,000. And more than 1,000 people signed up for our Generative AI for Healthcare event.

Data

In previous years, we would have told the story of AI as part of the story of data. That’s still correct; with its heavy emphasis on mathematics and statistics, AI is a natural outgrowth of data science. But this year, AI has become the superstar that gets top billing, while data is a supporting actor.

That doesn’t mean that data is unimportant. Far from it. Every company uses data: for planning, for making projections, for analyzing what’s happening within the business and the markets they serve. So it’s not surprising that the second biggest topic in data is Microsoft Power BI, with a 36% increase since 2022. SQL Server also showed a 5.3% increase, and statistics toolbox R increased by 4.8%.

Figure 4. Data analysis and databases

Data engineering was by far the most heavily used topic in this category; it showed a 3.6% decline, stabilizing after a huge gain from 2021 to 2022. Data engineering deals with the problem of storing data at scale and delivering that data to applications. It includes moving data to the cloud, building pipelines for acquiring data and getting data to application software (often in near real time), resolving the issues that are caused by data siloed in different organizations, and more. Two of the most important platforms for data engineering, Kafka and Spark, showed significant declines in 2023 (21% and 20%, respectively). Kafka and Spark have been workhorses for many years, but they are starting to show their age as they become “legacy technology.” (Hadoop, down 26%, is clearly legacy software in 2023.) Interest in Kafka is likely to rise as AI teams start implementing real-time models that have up-to-the-minute knowledge of external data. But we also have to point out that there are newer streaming platforms (like Pulsar) and newer data platforms (like Ray).

Designing enterprise-scale data storage systems is a core part of data engineering. Interest in data warehouses saw an 18% drop from 2022 to 2023. That’s not surprising; data warehouses also qualify as legacy technology. Two other patterns for enterprise-scale storage show significant increases: Usage of content about data lakes is up 37% and, in absolute terms, significantly higher than that of data warehouses. Usage for data mesh content is up 5.6%. Both lakes and meshes solve a basic problem: How do you store data so that it’s easy to access across an organization without building silos that are only relevant to specific groups? Data lakes can include data in many different formats, and it’s up to users to supply structure when data is utilized. A data mesh is a truly distributed solution: each group is responsible for its own data but makes that data available throughout the enterprise through an interoperability layer. Those newer technologies are where we see growth.

The two open source data analysis platforms were virtually unchanged in 2023. Usage of content about R increased by 3.6%; we’ve already seen that Python was unchanged, and pandas grew by 1.4%. Neither of these is going anywhere, but alternatives, particularly to pandas, are appearing.

Operations

Whether you call it operations, DevOps, or something else, this field has seen some important changes in the past year. We’ve witnessed the rise of developer platforms, along with the related topic, platform engineering. Both of those are too new to be reflected in our data: you can’t report content use before content exists. But they are influencing other topics.

We’ve said in the past that Linux is table stakes for a job in IT. That’s still true. But the more the deployment process is automated—and platform engineering is just the next step in “Automate All the Things”—the less developers and IT staff need to know about Linux. Software is packaged in containers, and the containers themselves run as virtual Linux instances, but developers don’t need to know how to find and kill out-of-control processes, do a backup, install device drivers, or perform any of the other tasks that are the core of system administration. Usage of content about Linux is down 6.9%: not a major change but possibly a reflection of the fact that the latest steps forward in deploying and managing software shield people from direct contact with the operating system.

Similar trends reduce what developers and IT staff need to know about Kubernetes, the near-ubiquitous container orchestrator (down 6.9%). Anyone who uses Kubernetes knows that it’s complex. We’ve long expected “something simpler” to come along and replace it. It hasn’t—but again, developer platforms put users a step further away from engaging with Kubernetes itself. Knowledge of the details is encapsulated either in a developer platform or, perhaps more often, in a Kubernetes service administered by a cloud provider. Kubernetes can’t be ignored, but it’s more important to understand high-level principles than low-level commands.

Figure 5. Infrastructure and operations

DevOps (9.0%) and SRE (13%) are also down, though we don’t think that’s significant. Terms come and go, and these are going. While operations is constantly evolving, we don’t believe we’ll ever get to the mythical state of “NoOps,” nor should we. Instead, we’ll see constant evolution as the ratio of systems managed to operations staff grows ever higher. But we do believe that sooner rather than later, someone will put a new name on the disciplines of DevOps and its close relative, SRE. That new name might be “platform engineering,” though that term says more about designing deployment pipelines than about carrying the pager and keeping the systems running; platform engineering is about treating developers as customers and designing internal developer platforms that make it easy to test and deploy software systems with minimal ceremony. We don’t believe that platform engineering subsumes or replaces DevOps. Both are partners in improving experience for developers and operations staff (and ratcheting up the ratio of systems managed to staff even higher).

That’s a lot of red ink. What’s in the black? Supply chain management is up 5.9%. That’s not a huge increase, but in the past few years we’ve been forced to think about how we manage the software supply chain. Any significant application easily has dozens of dependencies, and each of those dependencies has its own dependencies. The total number of dependencies, including both direct and inherited dependencies, can easily be hundreds or even thousands. Malicious operators have discovered that they can corrupt software archives, getting programmers to inadvertently incorporate malware into their software. Unfortunately, security problems never really go away; we expect software supply chain security to remain an important issue for the foreseeable (and unforeseeable) future.

We’ve already mentioned that MLOps, the discipline of deploying and managing models for machine learning and artificial intelligence, is up 14%. Machine learning and AI represent a new kind of software that doesn’t follow traditional rules, so traditional approaches to operations don’t work. The list of differences is long:

While most approaches to deployment are based on the idea that an application can be reproduced from a source archive, that isn’t true for AI. An AI system depends as much on the training data as it does on the source code, and we don’t yet have good tools for archiving training data.
While we’ve said that open source models such as Alpaca are much smaller than models like GPT-4 or Google’s Gemini, even the smallest of those models is very large by any reasonable standard.
While we’ve gotten used to automated testing as part of a deployment pipeline, AI models aren’t deterministic. A test doesn’t necessarily give the same result every time it runs. Testing is no less important for AI than it is for traditional software (arguably it’s more important), and we’re starting to see startups built around AI testing, but we’re still at the beginning.

That’s just a start. MLOps is a badly needed specialty. It’s good to see growing interest.

Security

Almost all branches of security showed growth from 2022 to 2023. That’s a welcome change: in the recent past, many companies talked about security but never made the investment needed to secure their systems. That’s changing, for reasons that are obvious to anyone who reads the news. Nobody wants to be a victim of data theft or ransomware, particularly now that ransomware has evolved into blackmail.

The challenges are really very simple. Network security, keeping intruders off of your network, was the most widely used topic and grew 5%. Firewalls, which are an important component of network security, grew 16%. Hardening, a much smaller topic that addresses making systems less vulnerable to attack, grew 110%. Penetration testing remained one of the most widely used topics. Usage dropped 5%, although a 10% increase for Kali Linux (an important tool for penetration testers) largely offsets that decline.

The 22% growth in security governance is another indicator of changed attitudes: security is no longer an ad hoc exercise that waits for something to happen and then fights fires. Security requires planning, training, testing, and auditing to ensure that policies are effective.

One key to security is knowing who your users are and which parts of the system each user can access. Identity and access management (IAM) has often been identified as a weakness, particularly for cloud security. As systems grow more complex, and as our concept of “identity” evolves from individuals to roles assigned to software services, IAM becomes much more than usernames and passwords. It requires a thorough understanding of who the actors are on your systems and what they’re allowed to do. This extends the old idea of “least privilege”: each actor needs the ability to do exactly what they need, no more and no less. The use of content about IAM grew 8.0% in the past year. It’s a smaller gain than we would have liked to see but not insignificant.

Figure 6. Security

Application security grew 42%, showing that software developers and operations staff are getting the message. The DevSecOps “shift left” movement, which focuses on software security early in the development process, appears to be winning; use of content about DevSecOps was up 30%. Similarly, those who deploy and maintain applications have become even more aware of their responsibilities. Developers may design identity and access management into the code, but operations is responsible for configuring these correctly and ensuring that access to applications is only granted appropriately. Security can’t be added after the fact; it has to be part of the software process from beginning to the end.

Advanced persistent threats (APTs) were all over the news a few years ago. We don’t see the term APT anywhere near as much as we used to, so we’re not surprised that usage has dropped by 35%. Nevertheless, nation-states with sophisticated offensive capabilities are very real, and cyber warfare is an important component of several international conflicts, including the war in Ukraine.

It’s disappointing to see that usage of content about zero trust has declined by 20%. That decrease is more than offset by the increase in IAM, which is an essential tool for zero trust. But don’t forget that IAM is just a tool and that the goal is to build systems that don’t rely on trust, that always verify that every actor is appropriately identified and authorized. How can you defend your IT infrastructure if you assume that attackers already have access? That’s the question zero trust answers. Trust nothing; verify everything.

Finally, compliance is down 27%. That’s more than offset by the substantial increase of interest in governance. Auditing for compliance is certainly a part of governance. Focusing on compliance itself, without taking into account the larger picture, is a problem rather than a solution. We’ve seen many companies that focus on compliance with existing standards and regulations while avoiding the hard work of analyzing risk and developing effective policies for security. “It isn’t our fault that something bad happened; we followed all the rules” is, at best, a poor way to explain systemic failure. If that compliance-oriented mindset is fading, good riddance. Compliance, understood properly, is an important component of IT governance. Understood badly, compliance is an unacceptable excuse.

Finally, a word about a topic that doesn’t yet appear in our data. There has, of course, been a lot of chatter about the use of AI in security applications. AI will be a great asset for log file analysis, intrusion detection, incident response, digital forensics, and other aspects of cybersecurity. But, as we’ve already said, there are always two sides to AI. How does AI change security itself? Any organization with AI applications will have to protect them from exploitation. What vulnerabilities does AI introduce that didn’t exist a few years ago? There are many articles about prompt injection, sneaky prompts designed to “jailbreak” AI systems, data leakage, and other vulnerabilities—and we believe that’s only the beginning. Securing AI systems will be a critical topic in the coming years.

Cloud Computing

Looking at platform usage for cloud-related topics, one thing stands out: cloud native. Not only is it the most widely used topic in 2023, but it grew 175% from 2022 to 2023. This marks a real transition. In the past, companies built software to run on-premises and then moved it to the cloud as necessary. Despite reports (including ours) that showed 90% or more “cloud adoption,” we always felt that was very optimistic. Sure, 90% of all companies may have one or two experiments in the cloud—but are they really building for the cloud? This huge surge in cloud native development shows that we’ve now crossed that chasm and that companies have stopped kicking the tires. They’re building for the cloud as their primary deployment platform.

You could, of course, draw the opposite conclusion by looking at cloud deployment, which is down 27%. If companies are developing for the cloud, how are those applications being deployed? That’s a fair question. However, as cloud usage grows, so does organizational knowledge of cloud-related topics, particularly deployment. Once an IT group has deployed its first application, the second isn’t necessarily “easy” or “the same,” but it is familiar. At this point in the history of cloud computing, we’re seeing few complete newcomers. Instead we’re seeing existing cloud users deploying more and more applications. We’re also seeing a rise in tools that streamline cloud deployment. Indeed, any provider worth thinking about has a tremendous interest in making deployment as simple as possible.

Figure 7. Cloud architecture

Use of content about cloud security grew 25%, and identity and access management (IAM) grew 8%. An epidemic of data theft and ransomware that continues to this day put security on the corporate map as a priority, not just an expense with annual budget requests that sounded like an extortion scam: “Nothing bad happened this year; give us more money and maybe nothing bad will happen next year.” And while the foundation of any security policy is good local security hygiene, it’s also true that the cloud presents its own issues. Identity and access management: locally, that means passwords, key cards, and (probably) two-factor authentication. In the cloud, that means IAM, along with zero trust. Same idea, but it would be irresponsible to think that these aren’t more difficult in the cloud.

Hybrid cloud is a smaller topic area that has grown significantly in the past year (145%). This growth points partly to the cloud becoming the de facto deployment platform for enterprise applications. It also acknowledges the reality of how cloud computing is adopted. Years ago, when “the cloud” was getting started, it was easy for a few developers in R&D to expense a few hours of time on AWS rather than requisitioning new hardware. The same was true for data-aware marketers who wanted to analyze what was happening with their potential customers—and they might choose Azure. When senior management finally awoke to the need for a “cloud strategy,” they were already in a hybrid situation, with multiple wildcat projects in multiple clouds. Mergers and buyouts complicated the situation more. If company A is primarily using AWS and company B has invested heavily in Google Cloud, what happens when they merge? Unifying behind a single cloud provider isn’t going to be worth it, even though cloud providers are providing tools to simplify migration (at the same time as they make their own clouds difficult to leave). The cloud is naturally hybrid. “Private cloud” and “public cloud,” when positioned as alternatives to each other and to a hybrid cloud, smell like “last year’s news.” It’s not surprising that usage has dropped 46% and 10%, respectively.

Figure 8. Cloud providers

What about the perennial horse race between Amazon Web Services, Microsoft Azure, and Google Cloud? Is anyone still interested, except perhaps investors and analysts? AWS showed a very, very small gain (0.65%), but Azure and Google Cloud showed significant losses (16% and 22%, respectively). We expected to see Azure catch up to AWS because of its lead in AI as a service, but it didn’t. As far as our platform is concerned, that’s still in the future.

Web Development

React and Angular continue to dominate web development. JavaScript is still the lingua franca of web development, and that isn’t likely to change any time soon.

But the usage pattern has changed slightly. Last year, React was up, and Angular was sharply down. This year, usage of React content hasn’t changed substantially (down 0.33%). Angular is down 12%, a smaller decline than last year but still significant. When a platform is as dominant as React, it may have nowhere to go but down. Is momentum shifting?

We see some interesting changes among the less popular frameworks, both old and new. First, Vue isn’t a large part of the overall picture, and it isn’t new—it’s been around since 2014—but if its 28% annual growth continues, it will soon become a dominant framework. That increase represents a solid turnaround after losing 17% from 2021 to 2022. Django is even older (created in 2005), but it’s still widely used—and with an 8% increase this year, it’s not going away. FastAPI is the newest of this group (2018). Even though it accounts for a very small percentage of platform use, it’s easy for a small change in usage to have a big effect. An 80% increase is hard to ignore.

It’s worth looking at these frameworks in a little more detail. Django and FastAPI are both Python-based, and FastAPI takes full advantage of Python’s type hinting feature. Python has long been an also-ran in web development, which has been dominated by JavaScript, React, and Angular. Could that be changing? It’s hard to say, and it’s worth noting that Flask, another Python framework, showed a 12% decrease. As a whole, Python frameworks probably declined from 2022 to 2023, but that may not be the end of the story. Given the number of boot camps training new web programmers in React, the JavaScript hegemony will be hard to overcome.

Figure 9. Web development

What about PHP, another long-standing framework that dates back to 1995, when the web was indeed young? PHP grew 5.9% in the past year. The use of content about PHP is small compared to frameworks like React and Angular or even Django. PHP certainly doesn’t inspire the excitement that it did in the 1990s. But remember that over 80% of the web is built on PHP. It’s certainly not trendy, it’s not capable of building the feature-rich sites that many users expect—but it’s everywhere. WordPress (down 4.8%), a content management system used for millions of websites, is based on PHP. But regardless of the number of sites that are built on PHP or WordPress, Indeed shows roughly three times as many job openings for React developers as for PHP and WordPress combined. PHP certainly isn’t going away, and it may even be growing slightly. But we suspect that PHP programmers spend most of their time maintaining older sites. They already know what they need to do that, and neither of those factors drives content usage.

What about some other highly buzzworthy technologies? After showing 74% growth from 2021 to 2022, WebAssembly (Wasm) declined by 41% in 2023. Blazor, a web framework for C# that generates code for Wasm, declined by 11%. Does that mean that Wasm is dying? We still believe Wasm is a very important technology, and we frequently read about amazing projects that are built with it. It isn’t yet a mature technology—and there are plenty of developers willing to argue that there’s no need for it. We may disagree, but that misses the point. Usage of Wasm content will probably decline gradually…until someone creates a killer application with it. Will that happen? Probably, but we can’t guess when.

What does this mean for someone who’s trying to develop their skills as a web developer? First, you still can’t go wrong with React, or even with Angular. The other JavaScript frameworks, such as Next.js, are also good options. Many of these are metaframeworks built on React, so knowing them makes you more versatile while leveraging knowledge you already have. If you’re looking to broaden your skills, Django would be a worthwhile addition. It’s a very capable framework, and knowing Python will open up other possibilities in software development that may be helpful in the future, even if not now.

Certification

This year, we took a different approach to certification. Rather than discussing certification for different subject areas separately (that is, cloud certification, security certification, etc.), we used data from the platform to build a list of the top 20 certifications and grouped them together. That process gives a slightly different picture of which certifications are important and why. We also took a brief look at O’Reilly’s new badges program, which gives another perspective on what our customers want to learn.

Figure 10. Certification

Based on the usage of content in our platform (including practice tests), the most popular certifications are security certifications: CISSP (which declined 4.8%) and CompTIA Security+ (which grew 6.0%). CISSP is an in-depth exam for security professionals, requiring at least five years’ experience before taking the exam. Security+ is more of an entry-level exam, and its growth shows that security staff are still in demand. ISACA’s Certified Information Security Manager (CISM) exam, which focuses on risk assessment, governance, and incident response, isn’t as popular but showed a 54% increase. CompTIA’s Certified Advanced Security Practitioner (CASP+) showed a 10% increase—not as large but part of the same trend. The Certified Ethical Hacker (CEH) exam, which focuses on techniques useful for penetration testing or red-teaming, is up 4.1%, after a decline last year. Those increases reflect where management is investing. Hoping that there won’t be an incident has been replaced by understanding exposure, putting in place governance mechanisms to minimize risk, and being able to respond to incidents when they occur.

What really stands out, however, isn’t security: it’s the increased use of content about CompTIA A+, which is up 58%. A+ isn’t a security exam; it’s advertised as an entry-level exam for IT support, stressing topics like operating systems, managing SaaS for remote work, troubleshooting software, hardware, and networking problems, and the like. It’s testimony to the large number of people who want to get into IT. Usage of content about the CompTIA Linux+ exam was much lower but also grew sharply (23%)—and, as we’ve said in the past, Linux is “table stakes” for almost any job in computing. It’s more likely that you’ll encounter Linux indirectly via containers or cloud providers rather than managing racks of computers running Linux; but you will be expected to know it. The Certified Kubernetes Administrator (CKAD) exam also showed significant growth (32%). Since it was first released in 2014, Kubernetes has become an inescapable part of IT operations. The biggest trend in IT, going back 70 years or so, has been the increase in the ratio of machines to operators: from multiple operators per machine in the ’60s to one operator per machine in the era of minicomputers to dozens and now, in the cloud, to hundreds and thousands. Complex as Kubernetes is—and we admit, we keep looking for a simpler alternative—it’s what lets IT groups manage large applications that are implemented as dozens of microservices and that run in thousands of containers on an uncountable number of virtual machines. Kubernetes has become an essential skill for IT. And certification is becoming increasingly attractive to people working in the field; there’s no other area in which we see so much growth.

Cloud certifications also show prominently. Although “the cloud” has been around for almost 20 years, and almost every company will say that they are “in the cloud,” in reality many companies are still making that transition. Furthermore, cloud providers are constantly adding new services; it’s a field where keeping up with change is difficult. Content about Amazon Web Services was most widely used. AWS Cloud Practitioner increased by 35%, followed by AWS Solutions Architect (Associate), which increased 15%. Microsoft Azure certification content followed, though the two most prominent exams showed a decline: Azure Fundamentals (AZ-900) was down 37%, and Azure Administration (AZ-104) was down 28%. Google Cloud certifications trailed the rest: Google’s Cloud Engineer showed solid growth (14%), while its Data Engineer showed a significant decline (40%).

Content about Microsoft’s AI-900 exam (Azure AI Fundamentals) was the least-used among the certifications that we tracked. However, it gained 121%—it more than doubled—from 2022 to 2023. While we can’t predict next year, this is the sort of change that trends are made of. Why did this exam suddenly get so hot? It’s easy, really: Microsoft’s investment in OpenAI, its integration of the GPT models into Bing and other products, and its AI-as-a-service offerings through Azure have suddenly made the company a leader in cloud-based AI. While we normally hedge our bets on smaller topics with big annual growth—it’s easy for a single new course or book to cause a large swing—AI isn’t going away, nor is Microsoft’s leadership in cloud services for AI developers.

Late in 2023, O’Reilly began to offer badges tied to course completion on the O’Reilly learning platform. Badges aren’t certifications, but looking at the top badges gives another take on what our customers are interested in learning. The results aren’t surprising: Python, GPT (not just ChatGPT), Kubernetes, software architecture, and Java are the most popular badges.

However, it’s interesting to look at the difference between our B2C customers (customers who have bought platform subscriptions as individuals) and B2B customers (who use the platform via a corporate subscription). For most topics, including those listed above, the ratio of B2B to B2C customers is in the range of 2:1 or 3:1 (two or three times as many corporate customers as individuals). The outliers are for topics like communications skills, Agile, Scrum, personal productivity, Excel, and presentation skills: users from B2B accounts obtained these badges four (or more) times as often as users with personal accounts. This makes sense: these topics are about teamwork and other skills that are valuable in a corporate environment.

There are few (if any) badge topics for which individual (B2C) users outnumbered corporate customers; that’s just a reflection of our customer base. However, there were some topics where the ratio of B2B to B2C customers was closer to one. The most interesting of these concerned artificial intelligence: large language models (LLMs), TensorFlow, natural language processing, LangChain, and MLOps. Why is there more interest among individuals than among corporate customers? Perhaps by next year we’ll know.

Design

The important story in design is about tools. Topics like user experience and web design are stable or slightly down (down 0.62% and 3.5%, respectively). But usage about design tools is up 105%, and the VC unicorn Figma is up 145%. Triple-digit growth probably won’t continue, but it’s certainly worth noticing. It highlights two important trends that go beyond typical design topics, like UX.

First, low-code and no-code tools aren’t new, but many new ones have appeared in the past year. Their success has been aided by artificial intelligence. We already have AI tools that can generate text, whether for a production site or for a mockup. Soon we’ll have no-code tools that don’t just spit out a wireframe but will be able to implement the design itself. They will be smart about what the user wants them to do. But to understand the importance of low-code to design, you have to look beyond the use designers will make of these tools. Designers will also be designing these tools, along with other AI-powered applications. Tools for designers have to be well-designed, of course: that’s trivial. But what many discussions about AI ignore is that designing applications that use AI well is far from trivial. We’ve all been blindsided by the success of ChatGPT, which made the GPT models instantly accessible to everyone. But once you start thinking about the possibilities, you realize that a chat is hardly an ideal interface for an AI system.² What will the users of these systems really need? We’ve only just started down that path. It will be an exciting journey—particularly for designers.

Figure 11. Design

Second, Figma is important because it’s a breakthrough in tools for collaboration. Tools that allow remote employees to collaborate productively are crucial when coworkers can be anywhere: in an office, at home, or on another continent. The last year and a half has been full of talk about virtual reality, metaverses, and the like. But what few have realized is that the metaverse isn’t about wearing goggles—it’s about seamless collaboration with friends and coworkers. Use of content about AR and VR dropped 25% because people have missed the real story: we don’t need 3D goggles; we need tools for collaboration. And, as with low-code, collaboration tools are both something to design with and something that needs to be designed. We’re on the edge of a new way to look at the world.

Use of content about information architecture was up 16%, recovering from its decline from 2021 to 2022. The need to present information well, to design the environments in which we consume information online, has never been more important. Every day, there’s more information to absorb and to navigate—and while artificial intelligence will no doubt help with that navigation, AI is as much a design problem as a design solution. (Though it’s a “good problem” to have.) Designing and building for accessibility is clearly related to information architecture, and it’s good to see more engagement with that content (up 47%). It’s been a long time coming, and while there’s still a long way to go, accessibility is being taken more seriously now than in the past. Websites that are designed to be usable by people with impairments aren’t yet the rule, but they’re no longer exceptions.

Professional Development

Almost everyone involved with software starts as a programmer. But that’s rarely where they end. At some point in their career, they are asked to write a specification, lead a team, manage a group, or maybe even found a company or serve as an executive in an existing company.

O’Reilly is the last company to believe that software developers are neck-bearded geeks who want nothing more than to live in a cave and type on their terminals. We’ve spent most of our history fighting against that stereotype. Nevertheless, going beyond software development is a frequent source of anxiety. That’s no doubt true for anyone stepping outside their comfort zone in just about any field, whether it’s accounting, law, medicine, or something else. But at some point in your career, you have to do something that you aren’t prepared to do. And, honestly, the best leaders are usually the ones who have some anxiety, not the ones whose reaction is “I was born to be a leader.”

Figure 12. Professional development

For the past few years, our audience has been interested in professional growth that goes beyond just writing software or building models for AI and ML. Project management is up 13%; the ability to manage large projects is clearly seen as an asset for employees who are looking for their next promotion (or, in some cases, their next job). Whatever their goals might be, anyone looking for a promotion or a new job—or even just solidifying their hold on their current job—would be well served by improving their communications skills (up 23%). Professional development (up 22%) is a catch-all topic that appears to be responding to the same needs. What’s driving this? 2023 began and ended with a lot of news about layoffs. But despite well-publicized layoffs from huge companies that overhired during the pandemic, there’s little evidence that the industry as a whole has suffered. People who are laid off seem to be snapped up quickly by new employers. Nevertheless, anxiety is real, and the emphasis we’re seeing on professional development (and specifically, communications and project management skills) is partially a result of that anxiety. Another part of the story is no doubt the way AI is changing the workplace. If generative AI makes people more efficient, it frees up time for them to do other things, including strategic thinking about product development and leadership. It may finally be time to value “individuals and interactions over processes and tools,” and “customer collaboration over contract negotiation,” as the Agile Manifesto claims. Doing so will require a certain amount of reeducation, focusing on areas like communications, interpersonal skills, and strategic thinking.

Product management, the discipline of managing a product’s lifecycle from the initial idea through development and release to the market, is also a desirable skill. So why is it only up 2.8% and not 20% like project management? Product management is a newer position in most companies; it has strong ties to marketing and sales, and as far as fear of layoffs is concerned (whether real or media driven), product management positions may be perceived as more vulnerable.

A look at the bottom of the chart shows that usage of content that teaches critical thinking grew 39%. That could be in part a consequence of ChatGPT and the explosion in artificial intelligence. Everyone knows that AI systems make mistakes, and almost every article that discusses these mistakes talks about the need for critical thinking to analyze AI’s output and find errors. Is that the cause? Or is the desire for better critical thinking skills just another aspect of professional growth?

A Strange Year?

Back at the start, I said this was a strange year. As much as we like to talk about the speed at which technology moves, reality usually doesn’t move that fast. When did we first start talking about data? Tim O’Reilly said “Data is the next Intel Inside” in 2005, almost 20 years ago. Kubernetes has been around for a decade, and that’s not counting its prehistory as Google’s Borg. Java was introduced in 1995, almost 30 years ago, and that’s not counting its set-top box prehistory as Oak and Green. C++ first appeared in 1985. Artificial intelligence has a prehistory as long as computing itself. When did AI emerge from its wintry cave to dominate the data science landscape? 2016 or 2017, when we were amazed by programs that could sort images into dogs and cats? Sure, Java has changed a lot; so has what we do with data. Still, there’s more continuity than disruption.

This year was one of the few years that could genuinely be called disruptive. Generative AI will change this industry in important ways. Programmers won’t become obsolete, but programming as we know it might. Programming will have more to do with understanding problems and designing good solutions than specifying, step-by-step, what a computer needs to do. We’re not there yet, but we can certainly imagine a day when a human language description leads reliably to working code, when “Do what I meant, not what I said” ceases to be the programmer’s curse. That change has already begun, with tools like GitHub Copilot. But to thrive in that new industry, programmers will need to know more about architecture, more about design, more about human relations—and we’re only starting to see that in our data, primarily for topics like product management and communications skills. And perhaps that’s the definition of “disruptive”: when our systems and our expectations change faster than our ability to keep up. I’m not worried about programmers “losing their jobs to an AI,” and I really don’t see that concern among the many programmers I talk to. But whatever profession you’re in, you will lose out if you don’t keep up. That isn’t kind or humane; that’s capitalism. And perhaps I should have used ChatGPT to write this report.³

Jerry Lee Lewis might have said “There’s a whole lotta disruption goin’ on.” But despite all this disruption, much of the industry remains unchanged. People seem to have tired of the terms DevOps and SRE, but so it goes: the half-life of a buzzword is inevitably short, and these have been extraordinarily long-lived. The problems these buzzwords represent haven’t gone away. Although we aren’t yet collecting the data (and don’t yet have enough content for which to collect data), developer platforms, self-service deployment, and platform engineering look like the next step in the evolution of IT operations. Will AI play a role in platform engineering? We’d be surprised if it didn’t.

Movement to the cloud continues. While we’ve heard talk of cloud “repatriation,” we see no evidence that it’s happening. We do see evidence that organizations realize that the cloud is naturally hybrid and that focusing on a single cloud provider is short-sighted. There’s also evidence that organizations are now paying more than lip service to security, particularly cloud security. That’s a very good sign, especially after many years in which companies approached security by hoping nothing bad would happen. As many chess grandmasters have said, “Hope is never a good strategy.”

In the coming year, AI’s disruption will continue to play out. What consequences will it have for programming? How will jobs (and job prospects) change? How will IT adapt to the challenge of managing AI applications? Will they rely on AI-as-a-service providers like OpenAI, Azure, and Google, or will they build on open source models, which will probably run in the cloud? What new vulnerabilities will AI applications introduce into the security landscape? Will we see new architectural patterns and styles? Will AI tools for software architecture and design help developers grapple with the difficulties of microservices, or will it just create confusion?

In 2024, we’ll face all of these questions. Perhaps we’ll start to see answers. One thing is clear: it’s going to be an exciting year.

Footnotes

Google Trends suggests that we may be seeing a resurgence in ChatGPT searches. Meanwhile, searches for ChatGPT on our platform appear to have bottomed out in October, with a very slight increase in November. This discrepancy aligns well with the difference between our platform and Google’s. If you want to use ChatGPT to write a term paper, are you going to search Google or O’Reilly?
Phillip Carter’s article, “All the Hard Stuff Nobody Talks About when Building Products with LLMs,” is worth reading. While it isn’t specifically about design, almost everything he discusses is something designers should think about.
I didn’t. Not even for data analysis.

Structural Evolutions in Data

Q McCallum — Tue, 19 Sep 2023 11:55:04 +0000

I am wired to constantly ask “what’s next?” Sometimes, the answer is: “more of the same.”

That came to mind when a friend raised a point about emerging technology’s fractal nature. Across one story arc, they said, we often see several structural evolutions—smaller-scale versions of that wider phenomenon.

Cloud computing? It progressed from “raw compute and storage” to “reimplementing key services in push-button fashion” to “becoming the backbone of AI work”—all under the umbrella of “renting time and storage on someone else’s computers.” Web3 has similarly progressed through “basic blockchain and cryptocurrency tokens” to “decentralized finance” to “NFTs as loyalty cards.” Each step has been a twist on “what if we could write code to interact with a tamper-resistant ledger in real-time?”

Most recently, I’ve been thinking about this in terms of the space we currently call “AI.” I’ve called out the data field’s rebranding efforts before; but even then, I acknowledged that these weren’t just new coats of paint. Each time, the underlying implementation changed a bit while still staying true to the larger phenomenon of “Analyzing Data for Fun and Profit.”

Consider the structural evolutions of that theme:

Stage 1: Hadoop and Big Data

By 2008, many companies found themselves at the intersection of “a steep increase in online activity” and “a sharp decline in costs for storage and computing.” They weren’t quite sure what this “data” substance was, but they’d convinced themselves that they had tons of it that they could monetize. All they needed was a tool that could handle the massive workload. And Hadoop rolled in.

In short order, it was tough to get a data job if you didn’t have some Hadoop behind your name. And harder to sell a data-related product unless it spoke to Hadoop. The elephant was unstoppable.

Until it wasn’t.

Hadoop’s value—being able to crunch large datasets—often paled in comparison to its costs. A basic, production-ready cluster priced out to the low-six-figures. A company then needed to train up their ops team to manage the cluster, and their analysts to express their ideas in MapReduce. Plus there was all of the infrastructure to push data into the cluster in the first place.

If you weren’t in the terabytes-a-day club, you really had to take a step back and ask what this was all for. Doubly so as hardware improved, eating away at the lower end of Hadoop-worthy work.

And then there was the other problem: for all the fanfare, Hadoop was really large-scale business intelligence (BI).

(Enough time has passed; I think we can now be honest with ourselves. We built an entire industry by … repackaging an existing industry. This is the power of marketing.)

Don’t get me wrong. BI is useful. I’ve sung its praises time and again. But the grouping and summarizing just wasn’t exciting enough for the data addicts. They’d grown tired of learning what is; now they wanted to know what’s next.

Stage 2: Machine learning models

Hadoop could kind of do ML, thanks to third-party tools. But in its early form of a Hadoop-based ML library, Mahout still required data scientists to write in Java. And it (wisely) stuck to implementations of industry-standard algorithms. If you wanted ML beyond what Mahout provided, you had to frame your problem in MapReduce terms. Mental contortions led to code contortions led to frustration. And, often, to giving up.

(After coauthoring Parallel R I gave a number of talks on using Hadoop. A common audience question was “can Hadoop run [my arbitrary analysis job or home-grown algorithm]?” And my answer was a qualified yes: “Hadoop could theoretically scale your job. But only if you or someone else will take the time to implement that approach in MapReduce.” That didn’t go over well.)

Goodbye, Hadoop. Hello, R and scikit-learn. A typical data job interview now skipped MapReduce in favor of white-boarding k-means clustering or random forests.

And it was good. For a few years, even. But then we hit another hurdle.

While data scientists were no longer handling Hadoop-sized workloads, they were trying to build predictive models on a different kind of “large” dataset: so-called “unstructured data.” (I prefer to call that “soft numbers,” but that’s another story.) A single document may represent thousands of features. An image? Millions.

Similar to the dawn of Hadoop, we were back to problems that existing tools could not solve.

The solution led us to the next structural evolution. And that brings our story to the present day:

Stage 3: Neural networks

High-end video games required high-end video cards. And since the cards couldn’t tell the difference between “matrix algebra for on-screen display” and “matrix algebra for machine learning,” neural networks became computationally feasible and commercially viable. It felt like, almost overnight, all of machine learning took on some kind of neural backend. Those algorithms packaged with scikit-learn? They were unceremoniously relabeled “classical machine learning.”

There’s as much Keras, TensorFlow, and Torch today as there was Hadoop back in 2010-2012. The data scientist—sorry, “machine learning engineer” or “AI specialist”—job interview now involves one of those toolkits, or one of the higher-level abstractions such as HuggingFace Transformers.

And just as we started to complain that the crypto miners were snapping up all of the affordable GPU cards, cloud providers stepped up to offer access on-demand. Between Google (Vertex AI and Colab) and Amazon (SageMaker), you can now get all of the GPU power your credit card can handle. Google goes a step further in offering compute instances with its specialized TPU hardware.

Not that you’ll even need GPU access all that often. A number of groups, from small research teams to tech behemoths, have used their own GPUs to train on large, interesting datasets and they give those models away for free on sites like TensorFlow Hub and Hugging Face Hub. You can download these models to use out of the box, or employ minimal compute resources to fine-tune them for your particular task.

You see the extreme version of this pretrained model phenomenon in the large language models (LLMs) that drive tools like Midjourney or ChatGPT. The overall idea of generative AI is to get a model to create content that could have reasonably fit into its training data. For a sufficiently large training dataset—say, “billions of online images” or “the entirety of Wikipedia”—a model can pick up on the kinds of patterns that make its outputs seem eerily lifelike.

Since we’re covered as far as compute power, tools, and even prebuilt models, what are the frictions of GPU-enabled ML? What will drive us to the next structural iteration of Analyzing Data for Fun and Profit?

Stage 4? Simulation

Given the progression thus far, I think the next structural evolution of Analyzing Data for Fun and Profit will involve a new appreciation for randomness. Specifically, through simulation.

You can see a simulation as a temporary, synthetic environment in which to test an idea. We do this all the time, when we ask “what if?” and play it out in our minds. “What if we leave an hour earlier?” (We’ll miss rush hour traffic.) “What if I bring my duffel bag instead of the roll-aboard?” (It will be easier to fit in the overhead storage.) That works just fine when there are only a few possible outcomes, across a small set of parameters.

Once we’re able to quantify a situation, we can let a computer run “what if?” scenarios at industrial scale. Millions of tests, across as many parameters as will fit on the hardware. It’ll even summarize the results if we ask nicely. That opens the door to a number of possibilities, three of which I’ll highlight here:

Moving beyond from point estimates

Let’s say an ML model tells us that this house should sell for $744,568.92. Great! We’ve gotten a machine to make a prediction for us. What more could we possibly want?

Context, for one. The model’s output is just a single number, a point estimate of the most likely price. What we really want is the spread—the range of likely values for that price. Does the model think the correct price falls between $743k-$746k? Or is it more like $600k-$900k? You want the former case if you’re trying to buy or sell that property.

Bayesian data analysis, and other techniques that rely on simulation behind the scenes, offer additional insight here. These approaches vary some parameters, run the process a few million times, and give us a nice curve that shows how often the answer is (or, “is not”) close to that $744k.

Similarly, Monte Carlo simulations can help us spot trends and outliers in potential outcomes of a process. “Here’s our risk model. Let’s assume these ten parameters can vary, then try the model with several million variations on those parameter sets. What can we learn about the potential outcomes?” Such a simulation could reveal that, under certain specific circumstances, we get a case of total ruin. Isn’t it nice to uncover that in a simulated environment, where we can map out our risk mitigation strategies with calm, level heads?

Moving beyond point estimates is very close to present-day AI challenges. That’s why it’s a likely next step in Analyzing Data for Fun and Profit. In turn, that could open the door to other techniques:

New ways of exploring the solution space

If you’re not familiar with evolutionary algorithms, they’re a twist on the traditional Monte Carlo approach. In fact, they’re like several small Monte Carlo simulations run in sequence. After each iteration, the process compares the results to its fitness function, then mixes the attributes of the top performers. Hence the term “evolutionary”—combining the winners is akin to parents passing a mix of their attributes on to progeny. Repeat this enough times and you may just find the best set of parameters for your problem.

(People familiar with optimization algorithms will recognize this as a twist on simulated annealing: start with random parameters and attributes, and narrow that scope over time.)

A number of scholars have tested this shuffle-and-recombine-till-we-find-a-winner approach on timetable scheduling. Their research has applied evolutionary algorithms to groups that need efficient ways to manage finite, time-based resources such as classrooms and factory equipment. Other groups have tested evolutionary algorithms in drug discovery. Both situations benefit from a technique that optimizes the search through a large and daunting solution space.

The NASA ST5 antenna is another example. Its bent, twisted wire stands in stark contrast to the straight aerials with which we are familiar. There’s no chance that a human would ever have come up with it. But the evolutionary approach could, in part because it was not limited by human sense of aesthetic or any preconceived notions of what an “antenna” could be. It just kept shuffling the designs that satisfied its fitness function until the process finally converged.

Taming complexity

Complex adaptive systems are hardly a new concept, though most people got a harsh introduction at the start of the Covid-19 pandemic. Cities closed down, supply chains snarled, and people—independent actors, behaving in their own best interests—made it worse by hoarding supplies because they thought distribution and manufacturing would never recover. Today, reports of idle cargo ships and overloaded seaside ports remind us that we shifted from under- to over-supply. The mess is far from over.

What makes a complex system troublesome isn’t the sheer number of connections. It’s not even that many of those connections are invisible because a person can’t see the entire system at once. The problem is that those hidden connections only become visible during a malfunction: a failure in Component B affects not only neighboring Components A and C, but also triggers disruptions in T and R. R’s issue is small on its own, but it has just led to an outsized impact in Φ and Σ.

(And if you just asked “wait, how did Greek letters get mixed up in this?” then … you get the point.)

Our current crop of AI tools is powerful, yet ill-equipped to provide insight into complex systems. We can’t surface these hidden connections using a collection of independently-derived point estimates; we need something that can simulate the entangled system of independent actors moving all at once.

This is where agent-based modeling (ABM) comes into play. This technique simulates interactions in a complex system. Similar to the way a Monte Carlo simulation can surface outliers, an ABM can catch unexpected or unfavorable interactions in a safe, synthetic environment.

Financial markets and other economic situations are prime candidates for ABM. These are spaces where a large number of actors behave according to their rational self-interest, and their actions feed into the system and affect others’ behavior. According to practitioners of complexity economics (a study that owes its origins to the Sante Fe Institute), traditional economic modeling treats these systems as though they run in an equilibrium state and therefore fails to identify certain kinds of disruptions. ABM captures a more realistic picture because it simulates a system that feeds back into itself.

Smoothing the on-ramp

Interestingly enough, I haven’t mentioned anything new or ground-breaking. Bayesian data analysis and Monte Carlo simulations are common in finance and insurance. I was first introduced to evolutionary algorithms and agent-based modeling more than fifteen years ago. (If memory serves, this was shortly before I shifted my career to what we now call AI.) And even then I was late to the party.

So why hasn’t this next phase of Analyzing Data for Fun and Profit taken off?

For one, this structural evolution needs a name. Something to distinguish it from “AI.” Something to market. I’ve been using the term “synthetics,” so I’ll offer that up. (Bonus: this umbrella term neatly includes generative AI’s ability to create text, images, and other realistic-yet-heretofore-unseen data points. So we can ride that wave of publicity.)

Next up is compute power. Simulations are CPU-heavy, and sometimes memory-bound. Cloud computing providers make that easier to handle, though, so long as you don’t mind the credit card bill. Eventually we’ll get simulation-specific hardware—what will be the GPU or TPU of simulation?—but I think synthetics can gain traction on existing gear.

The third and largest hurdle is the lack of simulation-specific frameworks. As we surface more use cases—as we apply these techniques to real business problems or even academic challenges—we’ll improve the tools because we’ll want to make that work easier. As the tools improve, that reduces the costs of trying the techniques on other use cases. This kicks off another iteration of the value loop. Use cases tend to magically appear as techniques get easier to use.

If you think I’m overstating the power of tools to spread an idea, imagine trying to solve a problem with a new toolset while also creating that toolset at the same time. It’s tough to balance those competing concerns. If someone else offers to build the tool while you use it and road-test it, you’re probably going to accept. This is why these days we use TensorFlow or Torch instead of hand-writing our backpropagation loops.

Today’s landscape of simulation tooling is uneven. People doing Bayesian data analysis have their choice of two robust, authoritative offerings in Stan and PyMC3, plus a variety of books to understand the mechanics of the process. Things fall off after that. Most of the Monte Carlo simulations I’ve seen are of the hand-rolled variety. And a quick survey of agent-based modeling and evolutionary algorithms turns up a mix of proprietary apps and nascent open-source projects, some of which are geared for a particular problem domain.

As we develop the authoritative toolkits for simulations—the TensorFlow of agent-based modeling and the Hadoop of evolutionary algorithms, if you will—expect adoption to grow. Doubly so, as commercial entities build services around those toolkits and rev up their own marketing (and publishing, and certification) machines.

Time will tell

My expectations of what to come are, admittedly, shaped by my experience and clouded by my interests. Time will tell whether any of this hits the mark.

A change in business or consumer appetite could also send the field down a different road. The next hot device, app, or service will get an outsized vote in what companies and consumers expect of technology.

Still, I see value in looking for this field’s structural evolutions. The wider story arc changes with each iteration to address changes in appetite. Practitioners and entrepreneurs, take note.

Job-seekers should do the same. Remember that you once needed Hadoop on your résumé to merit a second look; nowadays it’s a liability. Building models is a desired skill for now, but it’s slowly giving way to robots. So do you really think it’s too late to join the data field? I think not.

Keep an eye out for that next wave. That’ll be your time to jump in.

The next generation of developer productivity

Mike Loukides — Tue, 15 Aug 2023 10:06:33 +0000

To follow up on our previous survey about low-code and no-code tools, we decided to run another short survey about tools specifically for software developers—including, but not limited to, GitHub Copilot and ChatGPT. We’re interested in how “developer enablement” tools of all sorts are changing the workplace. Our survey 1 showed that while these tools increased productivity, they aren’t without their costs. Both upskilling and retraining developers to use these tools are issues.

Few professional software developers will find it surprising that software development teams are respondents said that productivity is the biggest challenge their organization faced, and another 19% said that time to market and deployment speed are the biggest challenges. Those two answers are almost the same: decreasing time to market requires increasing productivity, and improving deployment speed is itself an increase in productivity. Together, those two answers represented 48% of the respondents, just short of half.

HR issues were the second-most-important challenge, but they’re nowhere near as pressing. 12% of the respondents reported that job satisfaction is the greatest challenge; 11% said that there aren’t good job candidates to hire; and 10% said that employee retention is the biggest issue. Those three challenges total 33%, just one-third of the respondents.

¹ Our survey ran from April 18 to April 25, 2023. There were 739 responses.

It’s heartening to realize that hiring and retention are still challenges in this time of massive layoffs, but it’s also important to realize that these issues are less important than productivity.

But the big issue, the issue we wanted to explore, isn’t the challenges themselves; it’s what organizations are doing to meet them. A surprisingly large percentage of respondents (28%) aren’t making any changes to become more productive. But 20% are changing their onboarding and upskilling processes, 15% are hiring new developers, and 13% are using self-service engineering platforms.

We found that the biggest struggle for developers working with new tools is training (34%), and another 12% said the biggest struggle is “ease of use.” Together, that’s almost half of all respondents (46%). That was a surprise, since many of these tools are supposed to be low- or no-code. We’re thinking specifically about tools like GitHub Copilot, Amazon CodeWhisperer, and other code generators, but almost all productivity tools claim to make life simpler. At least at first, that’s clearly not true. There’s a learning curve, and it appears to be steeper than we’d have guessed. It’s also worth noting that 13% of the respondents said that the tools “didn’t effectively solve the problems that developers face.”

Over half of the respondents (51%) said that their organizations are using self-service deployment pipelines to increase productivity. Another 13% said that while they’re using self-service pipelines, they haven’t seen an increase in productivity. So almost two-thirds of the respondents are using self-service pipelines for deployment, and for most of them, the pipelines are working—reducing the overhead required to put new projects into production.

Finally, we wanted to know specifically about the effect of GitHub Copilot, ChatGPT, and other AI-based programming tools. Two-thirds of the respondents (67%) reported that these tools aren’t in use at their organizations. We suspect this estimate is lowballing Copilot’s actual usage. Back in the early 2000s, a widely quoted survey reported that CIOs almost unanimously said that their IT organizations weren’t making use of open source. How little they knew! Actual usage of Copilot, ChatGPT, and similar tools is likely to be much higher than 33%. We’re sure that even if they aren’t using Copilot or ChatGPT on the job, many programmers are experimenting with these tools or using them on personal projects.

What about the 33% who reported that Copilot and ChatGPT are in use at their organizations? First, realize that these are early adopters: Copilot was only released a year and a half ago, and ChatGPT has been out for less than a year. It’s certainly significant that they (and similar tools) have grabbed a third of the market in that short a period. It’s also significant that making a commitment to a new way of programming—and these tools are nothing if not a new kind of programming—is a much bigger change than, say, signing up for a ChatGPT account.

11% of the respondents said their organizations use Copilot and ChatGPT, and that the tools are primarily useful to junior developers; 13% said they’re primarily useful to senior developers. Another 9% said that the tools haven’t yielded an increase in productivity. The difference between junior and senior developers is closer than we expected. Common wisdom is that Copilot is more of an advantage to senior programmers, who are better able to describe the problem they need to solve in an intricate set of prompts and to notice bugs in the generated code quickly. Our survey hints that the difference between senior and junior developers is relatively small—although they’re almost certainly using Copilot in different ways. Junior developers are using it to learn and to spend less time solving problems by looking up solutions on Stack Overflow or searching online documentation. Senior developers are using it to help design and structure systems, and even to create production code.

Is developer productivity an issue? Of course; it always is. Part of the solution is improved tooling: self-service deployment, code-generation tools, and other new technologies and ideas. Productivity tools—and specifically the successors to tools like Copilot—are remaking software development in radical ways. Software developers are getting value from these tools, but don’t let the buzz fool you: that value doesn’t come for free. Nobody’s going to sit down with ChatGPT, type “Generate an enterprise application for selling shoes,” and come away with something worthwhile. Each has its own learning curve, and it’s easy to underestimate how steep that curve can be. Developer productivity tools will be a big part of the future; but to take full advantage of those tools, organizations will need to plan for skills development.

Technology Trends for 2023

Mike Loukides — Wed, 01 Mar 2023 11:44:44 +0000

This year’s report on the O’Reilly learning platform takes a detailed look at how our customers used the platform. Our goal is to find out what they’re interested in now and how that changed from 2021—and to make some predictions about what 2023 will bring.

A lot has happened in the past year. In 2021, we saw that GPT-3 could write stories and even help people write software; in 2022, ChatGPT showed that you can have conversations with an AI. Now developers are using AI to write software. Late in 2021, Mark Zuckerberg started talking about “the metaverse,” and fairly soon, everyone was talking about it. But the conversation cooled almost as quickly as it started. Back then, cryptocurrency prices were approaching a high, and NFTs were “a thing”…then they crashed.

What’s real, and what isn’t? Our data shows us what O’Reilly’s 2.8 million users are actually working on and what they’re learning day-to-day. That’s a better measure of technology trends than anything that happens among the Twitterati. The answers usually aren’t found in big impressive changes; they’re found in smaller shifts that reflect how people are turning the big ideas into real-world products. The signals are often confusing: for example, interest in content about the “big three” cloud providers is slightly down, while interest in content about cloud migration is significantly up. What does that mean? Companies are still “moving into the cloud”—that trend hasn’t changed—but as some move forward, others are pulling back (“repatriation”) or postponing projects. It’s gratifying when we see an important topic come alive: zero trust, which reflects an important rethinking of how security works, showed tremendous growth. But other technology topics (including some favorites) are hitting plateaus or even declining.

While we don’t discuss the economy as such, it’s always in the background. Whether or not we’re actually in a recession, many in our industry perceive us to be so, and that perception can be self-fulfilling. Companies that went on a hiring spree over the past few years are now realizing that they made a mistake—and that includes both giants that do layoffs in the tens of thousands and startups that thought they had access to an endless stream of VC cash. In turn, that reality influences the actions individuals take to safeguard their jobs or increase their value should they need to find a new one.

Methodology

This report is based on our internal “units viewed” metric, which is a single metric across all the media types included in our platform: ebooks, of course, but also videos and live training courses. We use units viewed because it measures what people actually do on our platform. But it’s important to recognize the metric’s shortcomings; as George Box (almost)¹ said, “All metrics are wrong, but some are useful.” Units viewed tends to discount the usage of new topics: if a topic is new, there isn’t much content, and users can’t view content that doesn’t exist. As a counter to our focus on units viewed, we’ll take a brief look at searches, which aren’t constrained by the availability of content. For the purposes of this report, units viewed is always normalized to 1, where 1 is assigned to the greatest number of units in any group of topics.

It’s also important to remember that these “units” are “viewed” by our users. Whether they access the platform through individual or corporate accounts, O’Reilly members are typically using the platform for work. Despite talk of “internet time,” our industry doesn’t change radically from day to day, month to month, or even year to year. We don’t want to discount or undervalue those who are picking up new ideas and skills—that’s an extremely important use of the platform. But if a company’s IT department were working on its ecommerce site in 2021, they were still working on that site in 2022, they won’t stop working on it in 2023, and they’ll be working on it in 2024. They might be adding AI-driven features or moving it to the cloud and orchestrating it with Kubernetes, but they’re not likely to drop React (or even PHP) to move to the latest cool framework.

However, when the latest cool thing demonstrates a few years of solid growth, it can easily become one of the well-established technologies. That’s happening now with Rust. Rust isn’t going to take over from Java and Python tomorrow, let alone in 2024 or 2025, but that’s a movement that’s real. Finally, it’s wise to be skeptical about “noise.” Changes of one or two percentage points often mean little. But when a mature technology that’s leading its category stops growing, it’s fair to wonder whether it’s hit a plateau and is en route to becoming a legacy technology.

The Biggest Picture

We can get a high-level view of platform usage by looking at usage for our top-level topics. Content about software development was the most widely used (31% of all usage in 2022), which includes software architecture and programming languages. Software development is followed by IT operations (18%), which includes cloud, and by data (17%), which includes machine learning and artificial intelligence. Business (13%), security (8%), and web and mobile (6%) come next. That’s a fairly good picture of our core audience’s interests: solidly technical, focused on software rather than hardware, but with a significant stake in business topics.

Total platform usage grew by 14.1% year over year, more than doubling the 6.2% gain we saw from 2020 to 2021. The topics that saw the greatest growth were business (30%), design (23%), data (20%), security (20%), and hardware (19%)—all in the neighborhood of 20% growth. Software development grew by 12%, which sounds disappointing, although in any study like this, the largest categories tend to show the least change. Usage of resources about IT operations only increased by 6.9%. That’s a surprise, particularly since the operations world is still coming to terms with cloud computing.

O’Reilly learning platform usage by topic year over year

While this report focuses on content usage, a quick look at search data gives a feel for the most popular topics, in addition to the fastest growing (and fastest declining) categories. Python, Kubernetes, and Java were the most popular search terms. Searches for Python showed a 29% year-over-year gain, while searches for Java and Kubernetes are almost unchanged: Java gained 3% and Kubernetes declined 4%. But it’s also important to note what searches don’t show: when we look at programming languages, we’ll see that content about Java is more heavily used than content about Python (although Python is growing faster).

Similarly, the actual use of content about Kubernetes showed a slight year-over-year gain (4.4%), despite the decline in the number of searches. And despite being the second-most-popular search term, units viewed for Kubernetes were only 41% of those for Java and 47% of those for Python. This difference between search data and usage data may mean that developers “live” in their programming languages, not in their container tools. They need to know about Kubernetes and frequently need to ask specific questions—and those needs generate a lot of searches. But they’re working with Java or Python constantly, and that generates more units viewed.

The Go programming language is another interesting case. “Go” and “Golang” are distinct search strings, but they’re clearly the same topic. When you add searches for Go and Golang, the Go language moves from 15th and 16th place up to 5th, just behind machine learning. However, change in use of the search term was relatively small: a 1% decline for Go, a 8% increase for Golang. Looking at Go as a topic category, we see something different: usage of content about Go is significantly behind the leaders, Java and Python, but still the third highest on our list, and with a 20% gain from 2021 to 2022.

Looking at searches is worthwhile, but it’s important to realize that search data and usage data often tell different stories.

Top searches on the O’Reilly learning platform year over year

Searches can also give a quick picture of which topics are growing. The top three year-over-year gains were for the CompTIA Linux+ certification, the CompTIA A+ certification, and transformers (the AI model that’s led to tremendous progress in natural language processing). However, none of these are what we might call “top tier” search terms: they had ranks ranging from 186 to 405. (That said, keep in mind that the number of unique search terms we see is well over 1,000,000. It’s a lot easier for a search term with a few thousand queries to grow than it is for a search term with 100,000 queries.)

The sharpest declines in search frequency were for cryptocurrency, Bitcoin, Ethereum, and Java 11. There are no real surprises here. This has been a tough year for cryptocurrency, with multiple scandals and crashes. As of late 2021, Java 11 was no longer the current long-term support (LTS) release of Java; that’s moved on to Java 17.

What Our Users Are Doing (in Detail)

That’s a high-level picture. But where are our users actually spending their time? To understand that, we’ll need to take a more detailed look at our topic hierarchy—not just at the topics at the top level but at those in the inner (and innermost) layers.

Software Development

The biggest change we’ve seen is the growth in interest in coding practices; 35% year-over-year growth can’t be ignored, and indicates that software developers are highly motivated to improve their practice of programming. Coding practices is a broad topic that encompasses a lot—software maintenance, test-driven development, maintaining legacy software, and pair programming are all subcategories. Two smaller categories that are closely related to coding practices also showed substantial increases: usage of content about Git (a distributed version control system and source code repository) was up 21%, and QA and testing was up 78%. Practices like the use of code repositories and continuous testing are still spreading to both new developers and older IT departments. These practices are rarely taught in computer science programs, and many companies are just beginning to put them to use. Developers, both new and experienced, are learning them on the job.

Going by units viewed, design patterns is the second-largest category, with a year-over-year increase of 13%. Object-oriented programming showed a healthy 24% increase. The two are closely related, of course; while the concept of design patterns is applicable to any programming paradigm, object-oriented programming (particularly Java, C#, and C++) is where they’ve taken hold.

It’s worth taking a closer look at design patterns. Design patterns are solutions to common problems—they help programmers work without “reinventing wheels.” Above all, design patterns are a way of sharing wisdom. They’ve been abused in the past by programmers who thought software was “good” if it used “design patterns,” and jammed as many into their code as possible, whether or not it was appropriate. Luckily, we’ve gotten beyond that now.

What about functional programming? The “object versus functional” debates of a few years ago are over for the most part. The major ideas behind functional programming can be implemented in any language, and functional programming features have been added to Java, C#, C++, and most other major programming languages. We’re now in an age of “multiparadigm” programming. It feels strange to conclude that object-oriented programming has established itself, because in many ways that was never in doubt; it has long been the paradigm of choice for building large software systems. As our systems are growing ever larger, object-oriented programming’s importance seems secure.

Leadership and management also showed very strong growth (38%). Software developers know that product development isn’t just about code; it relies heavily on communication, collaboration, and critical thinking. They also realize that management or team leadership may well be the next step in their career.

Finally, we’d be remiss not to mention quantum computing. It’s the smallest topic category in this group but showed a 24% year-over-year gain. The first quantum computers are now available through cloud providers like IBM and Amazon Web Services (AWS). While these computers aren’t yet powerful enough to do any real work, they make it possible to get a head start on quantum programming. Nobody knows when quantum computers will be substantial enough to solve real-world problems: maybe two years, maybe 20. But programmers are clearly interested in getting started.

Year-over-year growth for software development topics

Software architecture

Software architecture is a very broad category that encompasses everything from design patterns (which we also saw under software development) to relatively trendy topics like serverless and event-driven architecture. The largest topic in this group was, unsurprisingly, software architecture itself: a category that includes books on the fundamentals of software architecture, systems thinking, communication skills, and much more—almost anything to do with the design, implementation, and management of software. Not only was this a large category, but it also grew significantly: 26% from 2021 to 2022. Software architect has clearly become an important role, the next step for programming staff who want to level up their skills.

For several years, microservices has been one of the most popular topics in software architecture, and this year is no exception. It was the second-largest topic and showed 3.6% growth over 2021. Domain-driven design (DDD) was the third-most-commonly-used topic, although smaller; it also showed growth (19%). Although DDD has been around for a long time, it came into prominence with the rise of microservices as a way to think about partitioning an application into independent services.

Is the relatively low growth of microservices a sign of change? Have microservices reached a peak? We don’t think so, but it’s important to understand the complex relationship between microservices and monolithic architectures. Monoliths inevitably become more complex over time, as bug fixes, new business requirements, the need to scale, and other issues need to be addressed. Decomposing a complex monolith into a complex set of microservices is a challenging task and certainly one that can’t be underestimated: developers are trading one kind of complexity for another in the hope of achieving increased flexibility and scalability long-term. Microservices are no longer a “cool new idea,” and developers have recognized that they’re not the solution to every problem. However, they are a good fit for cloud deployments, and they leave a company well-positioned to offer its services via APIs and become an “as a service” company. Microservices are unlikely to decline, though they may have reached a plateau. They’ve become part of the IT landscape. But companies need to digest the complexity trade-off.

Web APIs, which companies use to provide services to remote client software via the web’s HTTP protocol, showed a very healthy increase (76%). This increase shows that we’re moving even more strongly to an “API economy,” where the most successful companies are built not around products but around services accessed through web APIs. That, after all, is the basis for all “software as a service” companies; it’s the basis on which all the cloud providers are built; it’s what ties Amazon’s business empire together. RESTful APIs saw a smaller increase (6%); the momentum has clearly moved from the simplicity of REST to more complex APIs that use JSON, GraphQL, and other technologies to move information.

The 29% increase in the usage of content about distributed systems is important. Several factors drive the increase in distributed systems: the move to microservices, the need to serve astronomical numbers of online clients, the end of Moore’s law, and more. The time when a successful application could run on a single mainframe—or even on a small cluster of servers in a rack—is long gone. Modern applications run across hundreds or thousands of computers, virtual machines, and cloud instances, all connected by high-speed networks and data buses. That includes software running on single laptops equipped with multicore CPUs and GPUs. Distributed systems require designing software that can run effectively in these environments: software that’s reliable, that stays up even when some servers or networks go down, and where there are as few performance bottlenecks as possible. While this category is still relatively small, its growth shows that software developers have realized that all systems are distributed systems; there is no such thing as an application that runs on a single computer.

Year-over-year growth for software architecture and design topics

What about serverless? Serverless looks like an excellent technology for implementing microservices, but it’s been giving us mixed signals for several years now. Some years it’s up slightly; some years it’s down slightly. This year, it’s down 14%, and while that’s not a collapse, we have to see that drop as significant. Like microservices, serverless is no longer a “cool new thing” in software architecture, but the decrease in usage raises questions: Are software developers nervous about the degree of control serverless puts in the hands of cloud providers, spinning up and shutting down instances as needed? That could be a big issue. Cloud customers want to get their accounts payable down, cloud providers want to get their accounts receivable up, and if the provider tweaks a few parameters that the customer never sees, that balance could change a lot. Or has serverless just plunged into the “trough of disillusionment” from which it will eventually emerge into the “plane of productivity”? Or maybe it’s just an idea whose time came and went? Whatever the reason, serverless has never established itself convincingly. Next year may give us a better idea…or just more ambiguity.

Programming languages

The stories we can tell about programming languages are little changed from last year. Java is the leader (with 1.7% year-over-year growth), followed by Python (3.4% growth). But as we look down the chart, we see some interesting challengers to the status quo. Go’s usage is only 20% of Java’s, but it’s seen 20% growth. That’s substantial. C++ is hardly a new language—and we typically expect older languages to be more stable—but it had 19% year-over-year growth. And Rust, with usage that’s only 9% of Java, had 22% growth from 2021 to 2022. Those numbers don’t foreshadow a revolution—as we said at the outset, very few companies are going to take infrastructure written in Java and rewrite it in Go or Rust just so they can be trend compliant. As we all know, a lot of infrastructure is written in COBOL, and that isn’t going anywhere. But both Rust and Go have established themselves in key areas of infrastructure: Docker and Kubernetes are both written in Go, and Rust is establishing itself in the security community (and possibly also the data and AI communities). Go and Rust are already pushing older languages like C++ and Java to evolve. With a few more years of 20% growth, Go and Rust will be challenging Java and Python directly, if they aren’t challenging them already for greenfield projects.

JavaScript is an anomaly on our charts: total usage is 19% of Java’s, with a 4.6% year-over-year decline. JavaScript shows up at, or near, the top on most programming language surveys, such as RedMonk’s rankings (usually in a virtual tie with Java and Python). However, the TIOBE Index shows more space between Python (first place), Java (fourth), and JavaScript (seventh)—more in line with our observations of platform usage. We attribute JavaScript’s decline partly to the increased influence of TypeScript, a statically typed variant of JavaScript that compiles to JavaScript (12% year-over-year increase). One thing we’ve noticed over the past few years: while programmers had a long dalliance with duck typing and dynamic languages, as applications (and teams) grew larger, developers realized the value of strong, statically typed languages (TypeScript certainly, but also Go and Rust, though these are less important for web development). This shift may be cyclical; a decade from now, we may see a revival of interest in dynamic languages. Another factor is the use of frameworks like React, Angular, and Node.js, which are undoubtedly JavaScript but have their own topics in our hierarchy. However, when you add all four together, you still see a 2% decline for JavaScript, without accounting for the shift from JavaScript to TypeScript. Whatever the reason, right now, the pendulum seems to be swinging away from JavaScript. (For more on frameworks, see the discussion of web development.)

The other two languages that saw a drop in usage are C# (6.3%) and Scala (16%). Is this just noise, or is it a more substantial decline? The change seems too large to be a random fluctuation. Scala has always been a language for backend programming, as has C# (though to a lesser extent). While neither language is particularly old, it seems their shine has worn off. They’re both competing poorly with Go and Rust for new users. Scala is also competing poorly with the newer versions of Java, which now have many of the functional features that initially drove interest in Scala.

Year-over-year growth for programming languages

Security

Computer security has been in the news frequently over the past few years. That unwelcome exposure has both revealed cracks in the security posture of many companies and obscured some important changes in the field. The cracks are all too obvious: most organizations do a bad job of the basics. According to one report, 91% of all attacks start with a phishing email that tricks a user into giving up their login credentials. Phishes are becoming more frequent and harder to detect. Basic security hygiene is as important as ever, but it’s getting more difficult. And cloud computing generates its own problems. Companies can no longer protect all of their IT systems behind a firewall; many of the servers are running in a data center somewhere, and IT staff has no idea where they are or even if they exist as physical entities.

Given this shift, it’s not surprising that zero trust, an important new paradigm for designing security into distributed systems, grew 146% between 2021 and 2022. Zero trust abandons the assumption that systems can be protected on some kind of secure network; all attempts to access any system, whether by a person or software, must present proper credentials. Hardening systems, while it received the least usage, grew 91% year over year. Other topics with significant growth were secure coding (40%), advanced persistent threats (55%), and application security (46%). All of these topics are about building applications that can withstand attacks, regardless of where they run.

Governance (year-over-year increase of 72%) is a very broad topic that includes virtually every aspect of compliance and risk management. Issues like security hygiene increasingly fall under “governance,” as companies try to comply with the requirements of insurers and regulators, in addition to making their operations more secure. Because almost all attacks start with a phish or some other kind of social engineering, just telling employees not to give their passwords away won’t help. Companies are increasingly using training programs, password managers, multifactor authentication, and other approaches to maintaining basic hygiene.

Year-over-year growth for security topics

Network security, which was the most heavily used security topic in 2022, grew by a healthy 32%. What drove this increase? Not the use of content about firewalls, which only grew 7%. While firewalls are still useful for protecting the IT infrastructure in a physical office, they’re of limited help when a substantial part of any organization’s infrastructure is in the cloud. What happens when an employee brings their laptop into the office from home or takes it to a coffee shop where it’s more vulnerable to attack? How do you secure WiFi networks for people working from home as well as in the office? The broader problem of network security has only become more difficult, and these problems can’t be solved by corporate firewalls.

Use of content about penetration testing and ethical hacking actually decreased by 14%, although it was the second-most-heavily-used security topic in our taxonomy (and the most heavily used in 2021).

Security certifications

Security professionals love their certifications. Our platform data shows that the most important certifications were CISSP (Certified Information Systems Security Professional) and CompTIA Security+. CISSP has long been the most popular security certification. It’s a very comprehensive certification oriented toward senior security specialists: candidates must have at least five years’ experience in the field to take the exam. Usage of CISSP-related content dropped 0.23% year over year—in other words, it was essentially flat. A change this small is almost certainly noise, but the lack of change may indicate that CISSP has saturated its market.

Compared to CISSP, the CompTIA Security+ certification is aimed at entry- or mid-level security practitioners; it’s a good complement to the other CompTIA certifications, such as Network+. Right now, the demand for security exceeds the supply, and that’s drawing new people into the field. This fits with the increase in the use of content to prepare for the CompTIA Security+ exam, which grew 16% in the past year. The CompTIA CSA+ exam (recently renamed the CYSA+) is a more advanced certification aimed specifically at security analysts; it showed 37% growth.

Year-over-year growth for security certifications

Use of content related to the Certified Ethical Hacker certification dropped 5.9%. The reasons for this decline aren’t clear, given that demand for penetration testing (one focus of ethical hacking) is high. However, there are many certifications specifically for penetration testers. It’s also worth noting that penetration testing is frequently a service provided by outside consultants. Most companies don’t have the budget to hire full-time penetration testers, and that may make the CEH certification less attractive to people planning their careers.

CBK isn’t an exam; it’s the framework of material around which the International Information System Security Certification Consortium, more commonly known as (ISC)², builds its exams. With a 31% year-over-year increase for CBK content, it’s another clear sign that interest in security as a profession is growing. And even though (ISC)²’s marquee certification, CISSP, has likely reached saturation, other (ISC)² certifications show clear growth: CCSP (Certified Cloud Security Professional) grew 52%, and SSCP (Systems Security Certified Practitioner) grew 67%. Although these certifications aren’t as popular, their growth is an important trend.

Data

Data is another very broad category, encompassing everything from traditional business analytics to artificial intelligence. Data engineering was the dominant topic by far, growing 35% year over year. Data engineering deals with the problem of storing data at scale and delivering that data to applications. It includes moving data to the cloud, building pipelines for acquiring data and getting data to application software (often in near real time), resolving the issues that are caused by data siloed in different organizations, and more.

Apache Spark, a platform for large-scale data processing, was the most widely used tool, even though the use of content about Spark declined slightly in the past year (2.7%). Hadoop, which would have led this category a decade ago, is still present, though usage of content about Hadoop dropped 8.3%; Hadoop has become a legacy data platform.

Microsoft Power BI has established itself as the leading business analytics platform; content about Power BI was the most heavily used, and achieved 31% year-over-year growth. NoSQL databases was second, with 7.6% growth—but keep in mind that NoSQL was a movement that spawned a large number of databases, with many different properties and designs. Our data shows that NoSQL certainly isn’t dead, despite some claims to the contrary; it has clearly established itself. However, the four top relational databases, if added together into a single “relational database” topic, would be the most heavily used topic by a large margin. Oracle grew 18.2% year over year; Microsoft SQL Server grew 9.4%; MySQL grew 4.7%; and PostgreSQL grew 19%.

Use of content about R, the widely used statistics platform, grew 15% from 2021. Similarly, usage of content about Pandas, the most widely used Python library for working with R-like data frames, grew 20%. It’s interesting that Pandas and R had roughly the same usage. Python and R have been competing (in a friendly way) for the data science market for nearly 20 years. Based on our usage data, right now it looks like a tie. R has slightly more market share, but Pandas has better growth. Both are staples in academic research: R is more of a “statistician’s workbench” with a comprehensive set of statistical tools, while Python and Pandas are built for programmers. The difference has more to do with users’ tastes than substance though: R is a fully capable programming language, and Python has excellent statistical and array-processing libraries.

Usage for content about data lakes and about data warehouses was also just about equal, but data lakes usage had much higher year-over-year growth (50% as opposed to 3.9%). Data lakes are a strategy for storing an organization’s data in an unstructured repository; they came into prominence a few years ago as an alternative to data warehouses. It would be useful to compare data lakes with data lakehouses and data meshes; those terms aren’t in our taxonomy yet.

Year-over-year growth for data analysis and database topics

Artificial intelligence

At the beginning of 2022, who would have thought that we would be asking an AI-driven chat service to explain source code (even if it occasionally makes up facts)? Or that we’d have AI systems that enable nonartists to create works that are on a par with professional designers (even if they can’t match Degas and Renoir)? Yet here we are, and we don’t have ChatGPT or generative AI in our taxonomy. The one thing that we can say is that 2023 will almost certainly take AI even further. How much further nobody knows.

For the past two years, natural language processing (NLP) has been at the forefront of AI research, with the release of Open AI’s popular tools GPT-3 and ChatGPT along with similar projects from Google, Meta, and others that haven’t been released. NLP has many industrial applications, ranging from automated chat servers to code generation (e.g., GitHub Copilot) to writing tools. It’s not surprising that NLP content was the most viewed and saw significant year-over-year growth (42%). All of this progress is based on deep learning, which was the second-most-heavily-used topic, with 23% growth. Interest in reinforcement learning seems to be off (14% decline), though that may turn around as researchers try to develop AI systems that are more accurate and that can’t be tricked into hate speech. Reinforcement learning with human feedback (RLHF) is one new technique that might lead to better-behaved language models.

There was also relatively little interest in content about chatbots (a 5.8% year-over-year decline). This reversal seems counterintuitive, but it makes sense in retrospect. The release of GPT-3 was a watershed event, an “everything you’ve done so far is out-of-date” moment. We’re excited about what will happen in 2023, though the results will depend a lot on how ChatGPT and its relatives are commercialized, as ChatGPT becomes a fee-based service, and both Microsoft and Google take steps towards chat-based search.

Year-over-year growth for artificial intelligence topics

Our learning platform gives some insight into the tools developers and researchers are using to work with AI. Based on units viewed, scikit-learn was the most popular library. It’s a relatively old tool, but it’s still actively maintained and obviously appreciated by the community: usage increased 4.7% over the year. While usage of content about PyTorch and TensorFlow is roughly equivalent (PyTorch is slightly ahead), it’s clear that PyTorch now has momentum. PyTorch increased 20%, while TensorFlow decreased 4.8%. Keras, a frontend library that uses TensorFlow, dropped 40%.

It’s disappointing to see so little usage of content on MLOps this year, along with a slight drop (4.0%) from 2021 to 2022. One of the biggest problems facing machine learning and artificial intelligence is deploying applications into production and then maintaining them. ML and AI applications need to be integrated into the deployment processes used for other IT applications. This is the business of MLOps, which presents a set of problems that are only beginning to be solved, including versioning for large sets of training data and automated testing to determine when a model has become stale and needs retraining. Perhaps it’s still too early, but these problems must be addressed if ML and AI are to succeed in the enterprise.

No-code and low-code tools for AI don’t appear in our taxonomy, unfortunately. Our report AI Adoption in the Enterprise 2022 argues that AutoML in its various incarnations is gradually gaining traction. This is a trend worth watching. While there’s very little training available on Google AutoML, Amazon AutoML, IBM AutoAI, Amazon SageMaker, and other low-code tools, they’ll almost certainly be an important force multiplier for experienced AI developers.

Infrastructure and Operations

Containers, Linux, and Kubernetes are the top topics within infrastructure and operations. Containers sits at the top of the list (with 2.5% year-over-year growth), with Docker, the most popular container, in fifth place (with a 4.4% decline). Linux, the second most used topic, grew 4.4% year over year. There’s no surprise here; as we’ve been saying for some time, Linux is “table stakes” for operations. Kubernetes is third, with 4.4% growth.

The containers topic is extremely broad: it includes a lot of content that’s primarily about Docker but also content about containers in general, alternatives to Docker (most notably Podman), container deployment, and many other subtopics. It’s clear that containers have changed the way we deploy software, particularly in the cloud. It’s also clear that containers are here to stay. Docker’s small drop is worth noting but isn’t a harbinger of change. Kubernetes deprecated direct Docker support at the end of 2020 in favor of the Container Runtime Interface (CRI). That change eliminated a direct tie between Kubernetes and Docker but doesn’t mean that containers built by Docker won’t run on Kubernetes, since Docker supports the CRI standard. A more convincing reason for the drop in usage is that Docker is no longer new and developers and other IT staff are comfortable with it. Docker itself may be a smaller piece of the operations ecosystem, and it may have plateaued, but it’s still very much there.

Content about Kubernetes was the third most widely viewed in this group, and usage grew 4.4% year over year. That relatively slow growth may mean that Kubernetes is close to a plateau. We increasingly see complaints that Kubernetes is overly complex, and we expect that, sooner or later, someone will build a container orchestration platform that’s simpler, or that developers will move toward “managed” solutions where a third party (probably a cloud provider) manages Kubernetes for them. One important part of the Kubernetes ecosystem, the service mesh, is declining; content about service mesh showed a 28% decline, while content about Istio (the service mesh implementation most closely tied to Kubernetes) declined 42%. Again, service meshes (and specifically Istio) are widely decried as too complex. It’s indicative (and perhaps alarming) that IT departments are resorting to “roll your own” for a complex piece of infrastructure that manages communications between services and microservices (including services for security). Alternatives are emerging. HashiCorp’s Consul and the open source Linkerd project are promising service meshes. UC Berkeley’s RISELab, which developed both Ray and Spark, recently announced SkyPilot, a tool with goals similar to Kubernetes but that’s specialized for data. Whatever the outcome, we don’t believe that Kubernetes is the last word in container orchestration.

Year-over-year growth for infrastructure and operations topics

If there’s any tool that defines “infrastructure as code,” it’s Terraform, which saw 74% year-over-year growth. Terraform’s goals are relatively simple: You write a simple description of the infrastructure you want and how you want that infrastructure configured. Terraform gathers the resources and configures them for you. Terraform can be used with all of the major cloud providers, in addition to private clouds (via OpenStack), and it’s proven to be an essential tool for organizations that are migrating to the cloud.

We took a separate look at the “continuous” methodologies (also known as CI/CD): continuous integration, continuous delivery, and continuous deployment. Overall, this group showed an 18% year-over-year increase in units viewed. This growth comes largely from a huge (40%) increase in the use of content about continuous delivery. Continuous integration showed a 22% decline, while continuous deployment had a 7.1% increase.

What does this tell us? The term continuous integration was first used by Grady Booch in 1991 and popularized by the Extreme Programming movement in the late 1990s. It refers to the practice of merging code changes into a single repository frequently, testing at each iteration to ensure that the project is always in a coherent state. Continuous integration is tightly coupled to continuous delivery; you almost always see CI/CD together. Continuous delivery is a practice that was developed at the second-generation web companies, including Flickr, Facebook, and Amazon, which radically changed IT practice by staging software updates for deployment several times daily. With continuous delivery, deployment pipelines are fully automated, requiring only a final approval to put a release into production. Continuous deployment is the newest (and smallest) of the three, emphasizing completely automated deployment to production: updates go directly from the developer into production, without any intervention. These methodologies are closely tied to each other. CI/CD/CD as a whole (and yes, nobody ever uses CD twice) is up 18% for the year. That’s a significant gain, and even though these topics have been around for a while, it’s evidence that growth is still possible.

Year-over-year growth for continuous methodologies

IT and operations certifications

The leading IT certification is clearly CompTIA, which showed a 41% year-over-year increase. The CompTIA family (Network+, A+, Linux+, and Security+) dominates the certification market. (The CompTIA Network+ showed a very slight decline (0.32%), which is probably just random fluctuation.) The Linux+ certification experienced tremendous year-over-year growth (47%). That growth is easy to understand. Linux has long been the dominant server operating system. In the cloud, Linux instances are much more widely used than the alternatives, though Windows is offered on Azure (of course) along with macOS. In the past few years, Linux’s market penetration has gone even deeper. We’ve already seen the role that containers are playing, and containers almost always run Linux as their operating system. In 1995, Linux might have been a quirky choice for people devoted to free and open source software. In 2023, Linux is mandatory for anyone in IT or software development. It’s hard to imagine getting a job or advancing in a career without demonstrating competence.

Year-over-year growth for IT certifications

It’s surprising to see the Cisco Certified Network Associate (CCNA) certification drop 18% and the Cisco Certified Network Professional (CCNP) certification drop 12%, as the Cisco certifications have been among the most meaningful and prestigious in IT for many years. (The Cisco Certified Internet Expert (CCIE) certification, while relatively small compared to the others, did show 70% growth.) There are several causes for this shift. First, as companies move workloads to the cloud or to colocation providers, maintaining a fleet of routers and switches becomes less important. Network certifications are less valuable than they used to be. But why then the increase in CCIE? While CCNA is an entry-level certification and CCNP is middle tier, CCIE is Cisco’s top-tier certification. The exam is very detailed and rigorous and includes hands-on work with network hardware. Hence the relatively small number of people who attempt it and study for it. However, even as companies offload much of their day-to-day network administration to the cloud, they still need people who understand networks in depth. They still have to deal with office networks, and with extending office networks to remote employees. While they don’t need staff to wrangle racks of data center routers, they do need network experts who understand what their cloud and colocation providers are doing. The need for network staff might be shrinking, but it isn’t going away. In a shrinking market, attaining the highest level of certification will have the most long-term value.

Cloud

We haven’t seen any significant shifts among the major cloud providers. Amazon Web Services (AWS) still leads, followed by Microsoft Azure, then Google Cloud. Together, this group represents 97% of cloud platform content usage. The bigger story is that we saw decreases in year-over-year usage for all three. The decreases are small and might not be significant: AWS is down 3.8%, Azure 7.5%, and Google Cloud 2.1%. We don’t know what’s responsible for this decline. We looked industry by industry; some were up, some were down, but there were no smoking guns. AWS showed a sharp drop in computers and electronics (about 27%), which is a relatively large category, and a smaller drop in finance and banking (15%), balanced by substantial growth in higher education (35%). There was a lot of volatility among industries that aren’t big cloud users—for example, AWS was up about 250% in agriculture—but usage among industries that aren’t major cloud users isn’t high enough to account for that change. (Agriculture accounts for well under 1% of total AWS content usage.) The bottom line is, as they say in the nightly financial news, “Declines outnumbered gains”: 16 out of 28 business categories showed a decline. Azure was similar, with 20 industries showing declines, although Azure saw a slight increase for finance and banking. The same was true for Google Cloud, though it benefited from an influx of individual (B2C) users (up 9%).

Over the past year, there’s been some discussion of “cloud repatriation”: bringing applications that have moved to the cloud back in-house. Cost is the greatest motivation for repatriation; companies moving to the cloud have often underestimated the costs, partly because they haven’t succeeded in using the cloud effectively. While repatriation is no doubt responsible for some of the decline, it’s at most a small part of the story. Cloud providers make it difficult to leave, which ironically might drive more content usage as IT staff try to figure out how to get their data back. A bigger issue might be companies that are putting cloud plans on hold because they hear of repatriation or that are postponing large IT projects because they fear a recession.

Of the smaller cloud providers, IBM showed a huge year-over-year increase (135%). Almost all of the change came from a significant increase in consulting and professional services (200% growth year over year). Oracle showed a 36% decrease, almost entirely due to a drop in content usage from the software industry (down 49%). However, the fact that Oracle is showing up at all demonstrates that it’s grown significantly over the past few years. Oracle’s high-profile deal to host all of TikTok’s data on US residents could easily solidify the company’s position as a significant cloud provider. (Or it could backfire if TikTok is banned.)

We didn’t include two smaller providers in the graph: Heroku (now owned by Salesforce) and Cloud Foundry (originally VMware, handed off to the company’s Pivotal subsidiary and then to the Cloud Foundry Foundation; now, multiple providers run Cloud Foundry software). Both saw fairly sharp year-over-year declines: 10% for Heroku, 26% for Cloud Foundry. As far as units viewed, Cloud Foundry is almost on a par with IBM. But Heroku isn’t even on the charts; it appears to be a service whose time has passed. We also omitted Tencent and Alibaba Cloud; they’re not in our subject taxonomy, and relatively little content is available.

Year-over-year growth for cloud providers

Cloud certifications followed a similar pattern. AWS certifications led, followed by Azure, followed by Google Cloud. We saw the same puzzling year-over-year decline here: 13% for AWS certification, 10% for Azure, and 6% for Google Cloud. And again, the drop was smallest for Google Cloud.

While usage of content about specific cloud providers dropped from 2021 to 2022, usage for content about other cloud computing topics grew. Cloud migration, a fairly general category for content about building cloud applications, grew 45%. Cloud service models also grew 41%. These increases may help us to understand why usage of content about the “big three” clouds decreased. As cloud usage moves beyond early adopters and becomes mainstream, the conversation naturally focuses less on individual cloud providers and more on high-level issues. After a few pilot projects and proofs of concept, learning about AWS, Azure, and Google Cloud is less important than planning a full-scale migration. How do you deploy to the cloud? How do you build services in the cloud? How do you integrate applications you have moved to the cloud with legacy applications that are staying in-house? At this point, companies know the basics and have to go the rest of the way.

Year-over-year growth for cloud certifications

With this in mind, it’s not at all surprising that our customers are very interested in hybrid clouds, for which content usage grew 28% year over year. Our users realize that every company will inevitably evolve toward a hybrid cloud. Either there’ll be a wildcat skunkworks project on some cloud that hasn’t been “blessed” by IT, or there’ll be an acquisition of a company that’s using a different provider, or they’ll need to integrate with a business partner using a different provider, or they don’t have the budget to move their legacy applications and data, or… The reasons are endless, but the conclusion is the same: hybrid is inevitable, and in many companies it’s already the reality.

The increase in use of content about private clouds (37%) is part of the same story. Many companies have applications and data that have to remain in-house (whether that’s physically on-premises or hosted at a data center offering colocation). It still makes sense for those applications to use APIs and deployment toolchains equivalent to those used in the cloud. “The cloud” isn’t the exception; it has become the rule.

Year-over-year growth for cloud architecture topics

Professional Skills

In the past year, O’Reilly users have been very interested in upgrading their professional and management skills. Every category in this relatively small group is up, and most of them are up significantly. Project management saw 47% year-over-year growth; professional development grew 37%. Use of content about the Project Management Professional (PMP) certification grew 36%, and interest in product management grew similarly (39%). Interest in communication skills increased 26% and interest in leadership grew by 28%. The two remaining categories that we tracked, IT management and critical thinking, weren’t as large and grew by somewhat smaller amounts (21% and 20%, respectively).

Several factors drive these increases. For a long time, software development and IT operations were seen as solo pursuits dominated by “neckbeards” and antisocial nerds, with some “rock stars” and “10x programmers” thrown in. This stereotype is wrong and harmful—not just to individuals but to teams and companies. In the past few years, we’ve heard a lot less about 10x developers and more about the importance of good communication, leadership, and mentoring. Our customers have realized that the key to productivity is good teamwork, not some mythical 10x developer. And there are certainly many employees who see positions in management, as a “tech lead,” as a product manager, or as a software architect, as the obvious next step in their careers. All of these positions stress the so-called “soft skills.” Finally, talk about a recession has been on the rise for the past year, and we continue to see large layoffs from big companies. While software developers and IT operations staff are still in high demand, and there’s no shortage of jobs, many are certainly trying to acquire new skills to improve their job security or to give themselves better options in the event that they’re laid off.

Year-over-year growth for professional skills topics

Web Development

The React and Angular frameworks continue to dominate web development. The balance is continuing to shift toward React (10% year-over-year growth) and away from Angular (a 17% decline). Many frontend developers feel that React offers better performance and is more flexible and easier to learn. Many new frameworks (and frameworks built on frameworks) are in play (Vue, Next.js, Svelte, and so on), but none are close to becoming competitors. Vue showed a significant year-over-year decline (17%), and the others didn’t make it onto the chart.

PHP is still a contender, of course, with almost no change (a decline of 1%). PHP advocates claim that 80% of the web is built on it: Facebook is built on PHP, for instance, along with millions of WordPress sites. Still, it’s hard to look at PHP and say that it’s not a legacy technology. Ruby on Rails grew 6.6%. Content usage for Ruby on Rails is similar to PHP, but Rails usage has been declining for some years. Is it poised for a comeback?

The use of content about JavaScript showed a slight decline (4.6%), but we don’t believe this is significant. In our taxonomy, content can only be tagged with one topic, and everything that covers React or Angular is implicitly about JavaScript. In addition, it’s interesting to see usage of TypeScript increasing (12%); TypeScript is a strongly typed variant of JavaScript that compiles (the right word is actually “transpiles”) to JavaScript, and it’s proving to be a better tool for large complex applications.

One important trend shows up at the bottom of the graph. WebAssembly is still a small topic, but it saw 74% growth from 2020 to 2021. And Blazor, Microsoft’s implementation of C# and .NET for WebAssembly, is up 59%. That’s a powerful signal. These topics are still small, but if they can maintain that kind of growth, they won’t be small for long. WebAssembly is poised to become an important part of web development.

Year-over-year growth for web development topics

Design

The heaviest usage in the design category went to user experience and related topics. User experience grew 18%, user research grew 5%, interface design grew 92%, and interaction design grew 36%. For years, we expected software to be difficult and uncomfortable to use. That’s changed. Apple made user interface design a priority early in the early 2000s, forcing other companies to follow if they wanted to remain competitive. The design thinking movement may no longer be in the news, but it’s had an effect: software teams think about design from the beginning. Even software developers who don’t have the word “design” in their job title need to think about and understand design well enough to build decent user interfaces and pleasant user experiences.

Usability, the only user-centric topic to show a decline, was only down 2.6%. It’s also worth noting that use of content about accessibility has grown 96%. Accessibility is still a relatively small category, but that kind of growth shows that accessibility is an aspect of user experience that can no longer be ignored. (The use of alt text for images is only one example: it’s become common on Twitter and is almost universal on Mastodon.)

Information architecture was down significantly (a 17% drop). Does that mean that interest has shifted from designing information flow to designing experiences, and is that a good thing?

Use of content about virtual and augmented reality is relatively small but grew 83%. The past year saw a lot of excitement around VR, Web3, the metaverse, and related topics. Toward the end of the year, that seemed to cool off. However, an 83% increase is noteworthy. Will that continue? It may depend on a new generation of VR products, both hardware and software. If Apple can make VR glasses that are comfortable and that people can wear without looking like aliens, 83% growth might seem small.

Year-over-year growth for design topics

The Future

We started out by saying that this industry doesn’t change as much from year to year as most people think. That’s true, but that doesn’t mean there’s no change. There are signals of important new trends—some completely new, some continuations of trends that started years ago. So what small changes are harbingers of bigger changes in the years to come?

The Go and Rust programming languages have shown significant growth both in the past year and for the last few years. There’s no sign that this growth will stop. It will take a few more years, but before long they’ll be on a par with Java and Python.

It’s no surprise that we saw huge gains for natural language processing and deep learning. GPT-3 and its successor ChatGPT are the current stars of the show. While there’s been a lot of talk about another “AI winter,” that isn’t going to happen. The success of ChatGPT (not to mention Stable Diffusion, Midjourney, and many projects going on at Meta and Google) will keep winter away, at least for another year. What will people build on top of ChatGPT and its successors? What new programming tools will we see? How will the meaning of “computer programming” change if AI assistants take over the task of writing code? What new research tools will become available, and will our new AI assistants persist in “making stuff up”? For several years now, AI has been the most exciting area in software. There’s lots to imagine, lots to build, and infinite space for innovation. As long as the AI community provides exciting new results, no one will be complaining and no one need fear the cold.

We’ve also seen a strong increase in interest in leadership, management, communication, and other “soft skills.” This interest isn’t new, but it’s certainly growing. Whether the current generation of programmers is getting tired of coding or whether they perceive soft skills as giving them better job security during a recession isn’t for us to say. It’s certainly true that better communication skills are an asset for any project.

Our audience is slightly less interested in content about the “big three” cloud providers (AWS, Azure, and Google Cloud), but they’re still tremendously interested in migrating to the cloud and taking advantage of cloud offerings. Despite many reports claiming that cloud adoption is almost universal (and I confess to writing some of them), I’ve long believed that we’re only in the early stages of cloud adoption. We’re now past the initial stage, during which a company might claim that it was “in the cloud” on the basis of a few trial projects. Cloud migration is serious business. We expect to see a new wave of cloud adoption. Companies in that wave won’t make naive assumptions about the costs of using the cloud, and they’ll have the tools to optimize their cloud usage. This new wave may not break until fears of a recession end, but it will come.

While the top-level security category grew 20%, we’d hoped to see more. For a long time, security was an afterthought, not a priority. That’s changing, but slowly. However, we saw huge gains for zero trust and governance. It’s unfortunate that these gains are driven by necessity (and the news cycle), but perhaps the message is getting through after all.

What about augmented and virtual reality (AR/VR), the metaverse, and other trendy topics that dominated much of the trade press? Interest in VR/AR content grew significantly, though what that means for 2023 is anyone’s guess. Long-term, the category probably depends on whether or not anyone can make AR glasses a fashion accessory that everyone needs to have. A bigger question is whether anyone can build a next-generation web that’s decentralized, and that fosters immediacy and collaboration without requiring exotic goggles. That’s clearly something that can be done: look no further than Figma (for collaboration), Mastodon (for decentralization), or Petals (for a cloud-less cloud).

Will these be the big stories for 2023? February is only just beginning; we have 11 months to find out.

Footnotes

1. Box said “models”; a metric is a kind of model, isn’t it?

Healthy Data

Mike Loukides — Tue, 15 Nov 2022 15:18:53 +0000

This summer, we started asking about “technical health.” We don’t see a lot of people asking what it means to use technology in healthy ways, at least not in so many words. That’s understandable because “technical health” is so broad that it’s difficult to think about. It’s easy to ask a question like “Are you using agile methodologies?” and assume that means “technical health.” Agile is good, right? But agile is not the whole picture. Neither is being “data driven.” Or Lean. Or using the latest, coolest programming languages and frameworks. Nor are any of these trends, present or past, irrelevant.

To investigate what’s meant by “technical health,” we have begun a series of short surveys to help us understand about what technical health means, and to help our readers think about the technical health of their organizations. The first survey looked at the use of data. It ran from August 30, 2022 to September 30, 2022. We received 693 responses, of which 337 were complete (i.e., the respondent answered all the questions). We didn’t include the incomplete respondents in our results, a practice that’s consistent with our other, lengthier surveys.

No single question and answer stood out; we can’t say “everybody does X” or “nobody does Y.” Whether or not that’s healthy in and of itself, it suggests that there isn’t yet any consensus about the role data plays. For example, the first question was “What percentage of enterprise-wide decisions are driven primarily by data?” 19% of the respondents answered “25% or less”; 31% said “76% or more.” We were surprised to see that the percentage of respondents who said that most decisions aren’t data driven was so similar to the percentage who thought they are. The difference between 19% and 31% looks much larger on paper than it is in practice. Yes, it’s almost a 2:1 ratio, but it shows that a lot of respondent work for companies that aren’t using data in their decision making. Even more significant, fully half of the respondents put their companies in the “sort of data driven” middle ground (26-50% and 51-75% received 25% and 26% of responses, respectively.) Does this mean that most companies are somewhere along the path towards being data-driven, with the “25% or less” cohort representing companies that are “catching up”? It’s hard to say.

We saw similar answers when we asked what percentage of business processes are informed by real-time data: 33% of respondents said 25% or less, while 21% said 76% or more. (26-50% and 51-75% received 22% and 24% of responses, respectively.) Incorporating real-time data into business processes is a heavier lift than running a few reports before a management meeting, so it isn’t surprising that fewer people are making widespread use of real-time data. These responses also suggest that the industry is in the process of transformation, of deciding how to use real-time data. There are many possibilities: managing inventory, supply chains, and manufacturing processes; automating customer service; and reducing time spent on routine paperwork, to name a few. But we don’t yet see a clear direction.

The bane of data science has been the HIPPO: the “highest paid person’s opinion.” When the HIPPO is in the room, data is used primarily to justify decisions that have already been made. The questions we asked don’t tell us much about the presence of the HIPPO, but we have to wonder: Is that why 20% of the respondents say that data doesn’t have a big influence in corporate decision-making? Are the 31% who said that over 75% of management decisions are based on data being ironic or naive? We don’t know, and need to keep that ambiguity in mind. Data can’t be the final word in any decision; we can’t underestimate the importance of instinct and a gut understanding of business dynamics. Data is always historical and, as such, is often better at maintaining a status quo than at helping to build a future–though when used well, data can shine light on the status quo, and help you question it. Data that’s used solely to justify the HIPPO isn’t healthy. Our survey doesn’t say much about the influence of the HIPPO. That’s something you’ll need to ponder when considering your company’s technical health.

We’ve been tracking the democratization of data–the ability of staff members who aren’t data scientists, analysts, or something else with “data” in their title–to access and use data in their job. Staff members need the ability to access and use data on their own, without going through intermediaries like database administrators (DBAs) and other custodians to generate reports and give them the data they need to work effectively. Self-service data is at the heart of the democratization process–and being data-driven isn’t meaningful if only a select priesthood has access to the data. Companies are slowly waking up to this reality. 26% of the respondents to our survey said that less than 20% of their company’s information workers had access to self-service query and analytics. That’s arguably a high percentage (and it was the most popular single answer), but we choose to see the glass as half (or three quarters) full: 74% said that more than 20% had access. (23% of the respondents said that 41% to 60% of their company’s data workers had self-service; 15% chose 61% to 80%; and 16% chose 81% to 100%.) No answer jumps out–but remember that, not so long ago, data was the property of actuaries, analysts, and DBAs. The walls between staff members and the data they need to do their job started to break down with the “data science” movement, but data scientists were still specialists and professionals. We’re still making the transition, but our survey shows that data is becoming more accessible, to more people, and we believe that’s healthy.

Roughly one third (35%) of the respondents said that their organization used a data catalog. That seems low, but it isn’t surprising. While we like to tell each other how quickly technology changes, the fact is that real adoption is almost always slow. Data catalogs are a relatively new technology; their age is measured in years, not decades. They’re gradually being accepted.

We got a similar result when we asked about data governance tools. 58% of the respondents said they weren’t using anything (“None of the above,” but “the above” included an option for a write-in.) SAP, IBM, SAS, and Informatica were leading choices (21%, 14%, 12%, and 11% respectively; respondents could select multiple answers). Again, we expect adoption of data governance tools to be slow. Data has been the “wild west” of the technology world for years, with few restrictions on what any organization could do with the data it collected. That party is coming to the end, but nobody’s pretending that the hangover is pleasant. Like data catalogs (to which they’re closely related), governance tools are relatively new and being adopted gradually.

Looking at the bigger picture, we see that companies are grappling with the demands of self-service data. They are also facing increasing regulation governing the use of data. Both of these trends require tooling to support them. Catalogs help users find and maintain metadata that shows what data exists and how it should be used; governance tools track data provenance and ensure that data is used in accordance with company policies and regulations. Fifteen years ago, we frequently heard “save everything, and wring every bit of value you can out of your data.” In the 2020s, it’s hard to see that as a good, healthy attitude. An important part of technological health is a commitment to use data ethically and legally. We believe we see movement in that direction.

Over the coming months, we’ll investigate technical health in other areas (next up is Security). For data health, we can close with some observations:

Data can’t be the only factor in decision making; human judgment plays an important role. But using data simply to justify a human decision that’s already been made is also a mistake. Technical health means knowing when and how to use data effectively; it’s a continuum, not a choice. We believe that companies are on the path to understanding that.
Empowering staff to make their own data queries and perform their own analyses can help them become more productive and engaged. But this doesn’t happen on its own. People need to know what data is available to them, and what that data means. That’s the purpose of a data catalog. And the use of data has to comply with regulations and company policies; that’s the purpose of governance. Data catalogs and governance tools are making inroads, but they’ve only started. Technical health means empowering users with the tools they need to make effective, ethical, and legal use of data.

Healthy data improves processes, questions preconceived opinions, and shines a light on practices that are unfair or discriminatory. We don’t expect anyone to look at their company and say “our data practices deserve a gold star”; that misses the point. Maintaining a healthy relationship to data is an ongoing practice, and that practice is still developing. We are learning to make better decisions with data; we are learning to implement governance to use data ethically (to say nothing of legally). Data health means that you and your company are on the path, not that you’ve arrived. We’re all making the same journey.

SQL: The Universal Solvent for REST APIs

Jon Udell — Tue, 19 Jul 2022 11:16:39 +0000

Data scientists working in Python or R typically acquire data by way of REST APIs. Both environments provide libraries that help you make HTTP calls to REST endpoints, then transform JSON responses into dataframes. But that’s never as simple as we’d like. When you’re reading a lot of data from a REST API, you need to do it a page at a time, but pagination works differently from one API to the next. So does unpacking the resulting JSON structures. HTTP and JSON are low-level standards, and REST is a loosely-defined framework, but nothing guarantees absolute simplicity, never mind consistency across APIs.

What if there were a way of reading from APIs that abstracted all the low-level grunt work and worked the same way everywhere? Good news! That is exactly what Steampipe does. It’s a tool that translates REST API calls directly into SQL tables. Here are three examples of questions that you can ask and answer using Steampipe.

1. Twitter: What are recent tweets that mention PySpark?

Here’s a SQL query to ask that question:

select
  id,
  text
from
  twitter_search_recent
where
  query = 'pyspark'
order by
  created_at desc
limit 5;

Here’s the answer:

+---------------------+------------------------------------------------------------------------------------------------>
| id                  | text                                                                                           >
+---------------------+------------------------------------------------------------------------------------------------>
| 1526351943249154050 | @dump Tenho trabalhando bastante com Spark, mas especificamente o PySpark. Vale a pena usar um >
| 1526336147856687105 | RT @MitchellvRijkom: PySpark Tip ⚡                                                            >
|                     |                                                                                                >
|                     | When to use what StorageLevel for Cache / Persist?                                             >
|                     |                                                                                                >
|                     | StorageLevel decides how and where data should be s…                                           >
| 1526322757880848385 | Solve challenges and exceed expectations with a career as a AWS Pyspark Engineer. https://t.co/>
| 1526318637485010944 | RT @JosMiguelMoya1: #pyspark #spark #BigData curso completo de Python y Spark con PySpark      >
|                     |                                                                                                >
|                     | https://t.co/qf0gIvNmyx                                                                        >
| 1526318107228524545 | RT @money_personal: PySpark & AWS: Master Big Data With PySpark and AWS                    >
|                     | #ApacheSpark #AWSDatabases #BigData #PySpark #100DaysofCode                                    >
|                     | -> http…                                                                                    >
+---------------------+------------------------------------------------------------------------------------------------>

The table that’s being queried here, twitter_search_recent, receives the output from Twitter’s /2/tweets/search/recent endpoint and formulates it as a table with these columns. You don’t have to make an HTTP call to that API endpoint or unpack the results, you just write a SQL query that refers to the documented columns. One of those columns, query, is special: it encapsulates Twitter’s query syntax. Here, we are just looking for tweets that match PySpark but we could as easily refine the query by pinning it to specific users, URLs, types (is:retweet, is:reply), properties (has:mentions, has_media), etc. That query syntax is the same no matter how you’re accessing the API: from Python, from R, or from Steampipe. It’s plenty to think about, and all you should really need to know when crafting queries to mine Twitter data.

2. GitHub: What are repositories that mention PySpark?

Here’s a SQL query to ask that question:

select 
  name, 
  owner_login, 
  stargazers_count 
from 
  github_search_repository 
where 
  query = 'pyspark' 
order by stargazers_count desc 
limit 10;

Here’s the answer:

+----------------------+-------------------+------------------+
| name                 | owner_login       | stargazers_count |
+----------------------+-------------------+------------------+
| SynapseML            | microsoft         | 3297             |
| spark-nlp            | JohnSnowLabs      | 2725             |
| incubator-linkis     | apache            | 2524             |
| ibis                 | ibis-project      | 1805             |
| spark-py-notebooks   | jadianes          | 1455             |
| petastorm            | uber              | 1423             |
| awesome-spark        | awesome-spark     | 1314             |
| sparkit-learn        | lensacom          | 1124             |
| sparkmagic           | jupyter-incubator | 1121             |
| data-algorithms-book | mahmoudparsian    | 1001             |
+----------------------+-------------------+------------------+

This looks very similar to the first example! In this case, the table that’s being queried, github_search_repository, receives the output from GitHub’s /search/repositories endpoint and formulates it as a table with these columns.

In both cases the Steampipe documentation not only shows you the schemas that govern the mapped tables, it also gives examples (Twitter, GitHub) of SQL queries that use the tables in various ways.

Note that these are just two of many available tables. The Twitter API is mapped to 7 tables, and the GitHub API is mapped to 41 tables.

3. Twitter + GitHub: What have owners of PySpark-related repositories tweeted lately?

To answer this question we need to consult two different APIs, then join their results. That’s even harder to do, in a consistent way, when you’re reasoning over REST payloads in Python or R. But this is the kind of thing SQL was born to do. Here’s one way to ask the question in SQL.

-- find pyspark repos
with github_repos as (
  select 
    name, 
    owner_login, 
    stargazers_count 
  from 
    github_search_repository 
  where 
    query = 'pyspark' and name ~ 'pyspark'
  order by stargazers_count desc 
  limit 50
),

-- find twitter handles of repo owners
github_users as (
  select
    u.login,
    u.twitter_username
  from
    github_user u
  join
    github_repos r
  on
    r.owner_login = u.login
  where
    u.twitter_username is not null
),

-- find corresponding twitter users
  select
    id
  from
    twitter_user t
  join
    github_users g
  on
    t.username = g.twitter_username
)

-- find tweets from those users
select
  t.author->>'username' as twitter_user,
  'https://twitter.com/' || (t.author->>'username') || '/status/' || t.id as url,
  t.text
from
  twitter_user_tweet t
join
  twitter_userids u
on
  t.user_id = u.id
where
  t.created_at > now()::date - interval '1 week'
order by
  t.author
limit 5

Here is the answer:

+----------------+---------------------------------------------------------------+------------------------------------->
| twitter_user   | url                                                           | text                                >
+----------------+---------------------------------------------------------------+------------------------------------->
| idealoTech     | https://twitter.com/idealoTech/status/1524688985649516544     | Are you able to find creative soluti>
|                |                                                               |                                     >
|                |                                                               | Join our @codility Order #API Challe>
|                |                                                               |                                     >
|                |                                                               | #idealolife #codility #php          >
| idealoTech     | https://twitter.com/idealoTech/status/1526127469706854403     | Our #ProductDiscovery team at idealo>
|                |                                                               |                                     >
|                |                                                               | Think you can solve it? ?          >
|                |                                                               | ➡  https://t.co/ELfUfp94vB https://t>
| ioannides_alex | https://twitter.com/ioannides_alex/status/1525049398811574272 | RT @scikit_learn: scikit-learn 1.1 i>
|                |                                                               | What's new? You can check the releas>
|                |                                                               |                                     >
|                |                                                               | pip install -U…                     >
| andfanilo      | https://twitter.com/andfanilo/status/1524999923665711104      | @edelynn_belle Thanks! Sometimes it >
| andfanilo      | https://twitter.com/andfanilo/status/1523676489081712640      | @juliafmorgado Good luck on the reco>
|                |                                                               |                                     >
|                |                                                               | My advice: power through it + a dead>
|                |                                                               |                                     >
|                |                                                               | I hated my first few short videos bu>
|                |                                                               |                                     >
|                |                                                               | Looking forward to the video ?

When APIs frictionlessly become tables, you can devote your full attention to reasoning over the abstractions represented by those APIs. Larry Wall, the creator of Perl, famously said: “Easy things should be easy, hard things should be possible.” The first two examples are things that should be, and are, easy: each is just 10 lines of simple, straight-ahead SQL that requires no wizardry at all.

The third example is a harder thing. It would be hard in any programming language. But SQL makes it possible in several nice ways. The solution is made of concise stanzas (CTEs, Common Table Expressions) that form a pipeline. Each phase of the pipeline handles one clearly-defined piece of the problem. You can validate the output of each phase before proceeding to the next. And you can do all this with the most mature and widely-used grammar for selection, filtering, and recombination of data.

Do I have to use SQL?

No! If you like the idea of mapping APIs to tables, but you would rather reason over those tables in Python or R dataframes, then Steampipe can oblige. Under the covers it’s Postgres, enhanced with foreign data wrappers that handle the API-to-table transformation. Anything that can connect to Postgres can connect to Steampipe, including SQL drivers like Python’s psycopg2 and R’s RPostgres as well as business-intelligence tools like Metabase, Tableau, and PowerBI. So you can use Steampipe to frictionlessly consume APIs into dataframes, then reason over the data in Python or R.

But if you haven’t used SQL in this way before, it’s worth a look. Consider this comparison of SQL to Pandas from How to rewrite your SQL queries in Pandas.

SQL	Pandas
select * from airports	airports
select * from airports limit 3	airports.head(3)
select id from airports where ident = ‘KLAX’	airports[airports.ident == ‘KLAX’].id
select distinct type from airport	airports.type.unique()
select * from airports where iso_region = ‘US-CA’ and type = ‘seaplane_base’	airports[(airports.iso_region == ‘US-CA’) & (airports.type == ‘seaplane_base’)]
select ident, name, municipality from airports where iso_region = ‘US-CA’ and type = ‘large_airport’	airports[(airports.iso_region == ‘US-CA’) & (airports.type == ‘large_airport’)][[‘ident’, ‘name’, ‘municipality’]]

We can argue the merits of one style versus the other, but there’s no question that SQL is the most universal and widely-implemented way to express these operations on data. So no, you don’t have to use SQL to its fullest potential in order to benefit from Steampipe. But you might find that you want to.

Quantum Computing without the Hype

Mike Loukides — Tue, 10 May 2022 11:45:05 +0000

Several weeks ago, I had a great conversation with Sebastian Hassinger about the state of quantum computing. It’s exciting–but also, not what a lot of people are expecting.

I’ve seen articles in the trade press telling people to invest in quantum computing now or they’ll be hopelessly behind. That’s silly. There are too many people in the world who think that a quantum computer is just a fast mainframe. It isn’t; quantum programming is completely different, and right now, the number of algorithms we know that will work on quantum computers is very small. You can count them on your fingers and toes. While it’s probably important to prepare for quantum computers that can decrypt current cryptographic codes, those computers won’t be around for 10-20 years. While there is still debate on how many physical qubits will be needed for error correction, and even on the meaning of a “logical” (error-corrected) qubit, the most common estimates are that it will require on the order of 1,000 error corrected qubits to break current encryption systems, and that it will take 1,000 physical qubits to make one error corrected qubit. So we’ll need an order of 1 million qubits, and current quantum computers are all in the area of 100 qubits. Figuring out how to scale our current quantum computers by 5 orders of magnitude may well be the biggest problem facing researchers, and there’s no solution in sight.

So what can quantum computers do now that’s interesting? First, they are excellent tools for simulating quantum behavior: the behavior of subatomic particles and atoms that make up everything from semiconductors to bridges to proteins. Most, if not all, modeling in these areas is based on numerical methods–and modern digital computers are great at that. But it’s time to think again about non-numerical methods: can a quantum computer simulate directly what happens when two atoms interact? Can it figure out what kind of molecules will be formed, and what their shapes will be? This is the next step forward in quantum computing, and while it’s still research, It’s a significant way forward. We live in a quantum world. We can’t observe quantum behavior directly, but it’s what makes your laptop work and your bridges stay up. If we can model that behavior directly with quantum computers, rather than through numeric analysis, we’ll make a huge step forward towards finding new kinds of materials, new treatments for disease, and more. In a way, it’s like the difference between analog and digital computers. Any engineer knows that digital computers spend a lot of time finding approximate numeric solutions to complicated differential equations. But until digital computers got sufficiently large and fast, the behavior of those systems could be modeled directly on analog computers. Perhaps the earliest known examples of analog computers are Stonehenge and the Antikythera mechanism, both of which were used to predict astronomical positions. Thousands of years before digital computers existed, these analog computers modeled the behavior of the cosmos, solving equations that their makers couldn’t have understood–and that we now solve numerically on digital computers.

Recently, researchers have developed a standardized control plane that should be able to work with all kinds of quantum devices. The design of the control plane, including software, is all open source. This should greatly decrease the cost of experimentation, allowing researchers to focus on the quantum devices themselves, instead of designing the circuitry needed to manage the qubits. It’s not unlike the dashboard of a car: relatively early in automotive history, we developed a fairly standard set of tools for displaying data and controlling the machinery. If we hadn’t, the development of automobiles would have been set back by decades: every automaker would need to design its own controls, and you’d need fairly extensive training on your specific car before you could drive it. Programming languages for quantum devices also need to standardize; fortunately, there has already been a lot of work in that direction. Open source development kits that provide libraries that can be called from Python to perform quantum operations (Qiskit, Braket, and Cirq are some examples), and OpenQASM is an open source “quantum assembly language” that lets programmers write (virtual) machine-level code that can be mapped to instructions on a physical machine.

Another approach to simulating quantum behavior won’t help probe quantum behavior, but might help researchers to develop algorithms for numerical computing. P-bits, or probabilistic bits, behave probabilistically but don’t depend on quantum physics: they’re traditional electronics that work at room temperature. P-bits have some of the behavior of qubits, but they’re much easier to build; the developers call them “poor man’s qubits.” Will p-bits make it easier to develop a quantum future? Possibly.

It’s important not to get over-excited about quantum computing. The best way to avoid a “trough of disillusionment” is to be realistic about your expectations in the first place. Most of what computers currently do will remain unchanged. There will be some breakthroughs in areas like cryptography, search, and a few other areas where we’ve developed algorithms. Right now, “preparing for quantum computing” means evaluating your cryptographic infrastructure. Given that infrastructure changes are difficult, expensive, and slow, it makes sense to prepare for quantum-safe cryptography now. (Quantum-safe cryptography is cryptography that can’t be broken by quantum computers–it does not require quantum computers.) Quantum computers may still be 20 years in the future, but infrastructure upgrades could easily take that long.

Practical (numeric) quantum computing at significant scale could be 10 to 20 years away, but a few breakthroughs could shorten that time drastically. In the meantime, a lot of work still needs to be done on discovering quantum algorithms. And a lot of important work can already be done by using quantum computers as tools for investigating quantum behavior. It is an exciting time; it’s just important to be excited by the right things, and not misled by the hype.

Recommendations for all of us

Chris Butler — Thu, 10 Mar 2022 14:07:27 +0000

If you live in a household with a communal device like an Amazon Echo or Google Home Hub, you probably use it to play music. If you live with other people, you may find that over time, the Spotify or Pandora algorithm seems not to know you as well. You’ll find songs creeping into your playlists that you would never have chosen for yourself. The cause is often obvious: I’d see a whole playlist devoted to Disney musicals or Minecraft fan songs. I don’t listen to this music, but my children do, using the shared device in the kitchen. And that shared device only knows about a single user, and that user happens to be me.

More recently, many people who had end-of-year wrap up playlists created by Spotify found that they didn’t quite fit, including myself:

Source: https://twitter.com/chrizbot/status/1466436036771389463

This kind of a mismatch and narrowing to one person is an identity issue that I’ve identified in previous articles about communal computing. Most home computing devices don’t understand all of the identities (and pseudo-identities) of the people who are using the devices. The services then extend the behavior collected through these shared experiences to recommend music for personal use. In short, these devices are communal devices: they’re designed to be used by groups of people, and aren’t dedicated to an individual. But they are still based on a single-user model, in which the device is associated with (and collects data about) a single identity.

These services should be able to do a better job of recommending content for groups of people. Platforms like Netflix and Spotify have tried to deal with this problem, but it is difficult. I’d like to take you through some of the basics for group recommendation services, what is being tried today, and where we should go in the future.

Common group recommendation methods

After seeing these problems with communal identities, I became curious about how other people have solved group recommendation services so far. Recommendation services for individuals succeed if they lead to further engagement. Engagement may take different forms, based on the service type:

Video recommendations – watching an entire show or movie, subscribing to the channel, watching the next episode
Commerce recommendations – buying the item, rating it
Music recommendations – listening to a song fully, adding to a playlist, liking

Collaborative filtering (deep dive in Programming Collective Intelligence) is the most common approach for doing individual recommendations. It looks at who I overlap with in taste and then recommends items that I might not have tried from other people’s lists. This won’t work for group recommendations because in a group, you can’t tell which behavior (e.g., listening or liking a song) should be attributed to which person. Collaborative filtering only works when the behaviors can all be attributed to a single person.

Group recommendation services build on top of these individualized concepts. The most common approach is to look at each individual’s preferences and combine them in some way for the group. Two key papers discussing how to combine individual preferences describe PolyLens, a movie recommendation service for groups, and CATS, an approach to collaborative filtering for group recommendations. A paper on ResearchGate summarized research on group recommendations back in 2007.

According to the PolyLens paper, group recommendation services should “create a ‘pseudo-user’ that represents the group’s tastes, and to produce recommendations for the pseudo-user.” There could be issues about imbalances of data if some members of the group provide more behavior or preference information than others. You don’t want the group’s preferences to be dominated by a very active minority.

An alternative to this, again from the PolyLens paper, is to “generate recommendation lists for each group member and merge the lists.” It’s easier for these services to explain why any item is on the list, because it’s possible to show how many members of the group liked a particular item that was recommended. Creating a single pseudo-user for the group might obscure the preferences of individual members.

The criteria for the success of a group recommendation service are similar to the criteria for the success of individual recommendation services: are songs and movies played in their entirety? Are they added to playlists? However, group recommendations must also take into account group dynamics. Is the algorithm fair to all members of the group, or do a few members dominate its recommendations? Do its recommendations cause “misery” to some group members (i.e., are there some recommendations that most members always listen to and like, but that some always skip and strongly dislike)?

There are some important questions left for implementers:

How do people join a group?
Should each individual’s history be private?
How do issues like privacy impact explainability?
Is the current use to discover something new or to revisit something that people have liked previously (e.g. find out about a new movie that no one has watched or rewatch a movie the whole family has seen together since it is easy)?

So far, there is a lot left to understand about group recommendation services. Let’s talk about a few key cases for Netflix, Spotify, and Amazon first.

Netflix avoiding the issue with profiles, or is it?

Back when Netflix was primarily a DVD service (2004), they launched profiles to allow different people in the same household to have different queues of DVDs in the same account. Netflix eventually extended this practice to online streaming. In 2014, they launched profiles on their streaming service, which asked the question “who’s watching?” on the launch screen. While multiple queues for DVDs and streaming profiles try to address similar problems they don’t end up solving group recommendations. In particular, streaming profiles per person leads to two key problems:

When a group wants to watch a movie together, one of the group’s profiles needs to be selected. If there are children present, a kids’ profile will probably be selected. However, that profile doesn’t take into account the preferences of adults who are present.
When someone is visiting the house, say a guest or a babysitter, they will most likely end up choosing a random profile. This means that the visitor’s behavioral data will be added to some household member’s profile, which could skew their recommendations.

How could Netflix provide better selection and recommendation streams when there are multiple people watching together? Netflix talked about this question in a blog post from 2012, but it isn’t clear to customers what they are doing:

That is why when you see your Top10, you are likely to discover items for dad, mom, the kids, or the whole family. Even for a single person household we want to appeal to your range of interests and moods. To achieve this, in many parts of our system we are not only optimizing for accuracy, but also for diversity.

Netflix was early to consider the various people using their services in a household, but they have to go further before meeting the requirements of communal use. If diversity is rewarded, how do they know it is working for everyone “in the room” even though they don’t collect that data? As you expand who might be watching, how would they know when a show or movie is inappropriate for the audience?

Amazon merges everyone into the main account

When people live together in a household, it is common for one person to arrange most of the repairs or purchases. When using Amazon, that person will effectively get recommendations for the entire household. Amazon focuses on increasing the number of purchases made by that person, without understanding anything about the larger group. They will offer subscriptions to items that might be consumed by a whole household, but mistaking those for the purchases of an individual.

The result is that the person who wanted the item will never see additional recommendations they may have liked if they aren’t the main account holder–and the main account holder might ignore those recommendations because they don’t care. I wonder if Amazon changes recommendations to individual accounts that are part of the same Prime membership; this might address some of this mismatch.

The way that Amazon ties these accounts together is still subject to key questions that will help create the right recommendations for a household. How might Amazon understand that purchases such as food and other perishables are for the household, rather than an individual? What about purchases that are gifts for others in the household?

Spotify is leading the charge with group playlists

Spotify has created group subscription packages called Duo (for couples) and Premium Family (for more than two people). These packages not only simplify the billing relationship with Spotify; they also provide playlists that consider everyone in the subscription.

The shared playlist is the union of the accounts on the same subscription. This creates a playlist of up to 50 songs that all accounts can see and play. There are some controls that allow account owners to flag songs that might not be appropriate for everyone on the subscription. Spotify provides a lot of information about how they construct the Blend playlist in a recent blog post. In particular, they weighed whether they should try to reduce misery or maximize joy:

“Minimize the misery” is valuing democratic and coherent attributes over relevance. “Maximize the joy” values relevance over democratic and coherent attributes. Our solution is more about maximizing the joy, where we try to select the songs that are most personally relevant to a user. This decision was made based on feedback from employees and our data curation team.

Reducing misery would most likely provide better background music (music that is not unpleasant to everyone in the group), but is less likely to help people discover new music from each other.

Spotify was also concerned about explainability: they thought people would want to know why a song was included in a blended playlist. They solved this problem, at least partly, by showing the picture of the person from whose playlists the song came.

These multi-person subscriptions and group playlists solve some problems, but they still struggle to answer certain questions we should ask about group recommendation services. What happens if two people have very little overlapping interest? How do we detect when someone hates certain music but is just OK with others? How do they discover new music together?

Reconsidering the communal experience based on norms

Most of the research into group recommendation services has been tweaking how people implicitly and explicitly rate items to be combined into a shared feed. These methods haven’t considered how people might self-select into a household or join a community that wants to have group recommendations.

For example, deciding what to watch on a TV may take a few steps:

Who is in the room? Only adults or kids too? If there are kids present, there should be restrictions based on age.
What time of day is it? Are we taking a midday break or relaxing after a hard day? We may opt for educational shows for kids during the day and comedy for adults at night.
Did we just watch something from which an algorithm can infer what we want to watch next? This will lead to the next episode in a series.
Who hasn’t gotten a turn to watch something yet? Is there anyone in the household whose highest-rated songs haven’t been played? This will lead to turn taking.
And more…

As you can see, there are contexts, norms, and history are all tied up in the way people decide what to watch next as a group. PolyLens discussed this in their paper, but didn’t act on it:

The social value functions for group recommendations can vary substantially. Group happiness may be the average happiness of the members, the happiness of the most happy member, or the happiness of the least happy member (i.e., we’re all miserable if one of us is unhappy). Other factors can be included. A social value function could weigh the opinion of expert members more highly, or could strive for long-term fairness by giving greater weight to people who “lost out” in previous recommendations.

Getting this highly contextual information is very hard. It may not be possible to collect much more than “who is watching” as Netflix does today. If that is the case, we may want to reverse all of the context to the location and time. The TV room at night will have a different behavioral history than the kitchen on a Sunday morning.

One way to consider the success of a group recommendation service is how much browsing is required before a decision is made? If we can get someone watching or listening to something with less negotiation, that could mean the group recommendation service is doing its job.

With the proliferation of personal devices, people can be present to “watch” with everyone else but not be actively viewing. They could be playing a game, messaging with someone else, or simply watching something else on their device. This flexibility raises the question of what “watching together” means, but also lowers the concern that we need to get group recommendations right all the time. It’s easy enough for someone to do something else. However, the reverse isn’t true. The biggest mistake we can make is to take highly contextual behavior gathered from a shared environment and apply it to my personal recommendations.

Contextual integrity and privacy of my behavior

When we start mixing information from multiple people in a group, it’s possible that some will feel that their privacy has been violated. Using some of the framework of Contextual Integrity, we need to look at the norms that people expect. Some people might be embarrassed if the music they enjoy privately was suddenly shown to everyone in a group or household. Is it OK to share explicit music with the household even if everyone is OK with explicit music in general?

People already build very complex mental models about how services like Spotify work and sometimes personify them as “folk theories.” The expectations will most likely change if group recommendation services are brought front and center. Services like Spotify will appear to be more like a social network if they don’t bury who is currently logged into a small profile picture in the corner; they should show everyone who is being considered for the group recommendations at that moment.

Privacy laws and regulations are becoming more patchwork not only worldwide (China has recently created regulation of content recommendation services) but even within states of the US. Collecting any data without appropriate disclosure and permission may be problematic. The fuel of recommendation services, including group recommendation services, is behavioral data about people that will fall under these laws and regulations. You should be considering what is best for the household over what is best for your organization.

The dream of the whole family

Today there are various efforts for improving recommendations to people living in households. These efforts miss the mark by not considering all of the people who could be watching, listening, or consuming the goods. This means that people do not get what they really want, and that companies get less engagement or sales than they would like.

The key to fixing these issues is to do a better job of understanding who is in the room, rather than making assumptions that reduce all the group members down to a single account. To do so will require user experience changes that bring the household community front and center.

If you are considering how you build these services, start with the expectations of the people in the environment, rather than forcing the single user model on people. When you do, you will provide something great for everyone who is in the room: a way to enjoy something together.

Epstein Barr and the Cause of Cause

Mike Loukides — Tue, 08 Mar 2022 12:17:00 +0000

One of the most intriguing news stories of the new year claimed that the Epstein-Barr virus (EBV) is the “cause” of Multiple Sclerosis (MS), and suggested that antiviral medications or vaccinations for Epstein-Barr could eliminate MS.

I am not an MD or an epidemiologist. But I do think this article forces us to think about the meaning of “cause.” Although Epstein-Barr isn’t a familiar name, it’s extremely common; a good estimate is that 95% of the population is infected with it. It’s a variant of Herpes; if you’ve ever had mononucleosis, you’ve had it; most infections are asymptomatic. We hear much more about MS; I’ve had friends who have died from it. But MS is much less common: about 0.036% of the population has it (35.9 per 100,000).

We know that causation isn’t a one-size-fits-all thing: if X happens, then Y always happens. Lots of people smoke; we know that smoking causes lung cancer; but many people who smoke don’t get lung cancer. We’re fine with that; the causal connection has been painstakingly documented in great detail, in part because the tobacco industry went to such great lengths to spread misinformation.

But what does it mean to say that a virus that infects almost everyone causes a disease that affects very few people? The researchers appear to have done their job well. They studied 10 million people in the US military. 5 percent of those were negative for Epstein-Barr at the start of their service. 955 of that group were eventually diagnosed with MS, and had been infected with EBV prior to their MS diagnosis, indicating a risk factor 32 times higher than for those without EBV.

It is certainly fair to say that Epstein-Barr is implicated in MS, or that it contributes to MS, or some other phrase (that could not unreasonably be called “weasel words”). Is there another trigger that only has an effect when EBV is already present? Or is EBV the sole cause of MS, a cause that just doesn’t take effect in the vast majority of people?

This is where we have to think very carefully about causality, because as important as this research is, it seems like something is missing. An omitted variable, perhaps a genetic predisposition? Some other triggering condition, perhaps environmental? Cigarettes were clearly a “smoking gun”: 10 to 20 percent of smokers develop lung cancer (to say nothing of other diseases). EBV may also be a smoking gun, but one that only goes off rarely.

If there are no other factors, we’re justified in using the word “causes.” But it’s hardly satisfying—and that’s where the more precise language of causal inference runs afoul of human language. Mathematical language is more useful: Perhaps EBV is “necessary” for MS (i.e., EBV is required; you can’t get MS without it), but clearly not “sufficient” (EBV doesn’t necessarily lead to MS). Although once again, the precision of mathematics may be too much.

Biological systems aren’t necessarily mathematical, and it is possible that there is no “sufficient” condition; EBV just leads to MS in an extraordinarily small number of instances. In turn, we have to take this into account in decision-making. Does it make sense to develop a vaccine against a rare (albeit tragic, disabling, and inevitably fatal) disease? If EBV is implicated in other diseases, possibly. However, vaccines aren’t without risk (or expense), and even though the risk is very small (as it is for all the vaccines we use today), it’s not clear that it makes sense to take that risk for a disease that very few people get. How do you trade off a small risk against a very small reward? Given the anti-vax hysteria around COVID, requiring children to be vaccinated for a rare disease might not be poor public health policy; it might be the end of public health policy.

More generally: how do you build software systems that predict rare events? This is another version of the same problem—and unfortunately, the policy decision we are least likely to make is not to create such software. The abuse of such systems is a clear and present danger: for example, AI systems that pretend to predict “criminal behavior” on the basis of everything from crime data to facial images, are already being developed. Many are already in use, and in high demand from law enforcement agencies. They will certainly generate far more false positives than true positives, stigmatizing thousands (if not millions) of people in the process. Even with carefully collected, unbiased data (which doesn’t exist), and assuming some kind of causal connection between past history, physical appearance, and future criminal behavior (as in the discredited 19th century pseudoscience of physiognomy), it is very difficult, if not impossible, to reason from a relatively common cause to a very rare effect. Most people don’t become criminals, regardless of their physical appearance. Deciding a priori who will can only become an exercise in applied racism and bias.

Virology aside, the Epstein-Barr virus has one thing to teach us. How do we think about a cause that rarely causes anything? That is a question we need to answer.