GMD Bioinformatics

Managing Expectations with the Medallion Architecture

Giovanni Dall'Olio — Tue, 08 Apr 2025 13:38:20 +0000

(Original article on Medium)

The Medallion Architecture is a framework from Databricks that enables managing expectations for data teams.

The Medallion Architecture makes it possible to balance data urgency and data quality, for data teams.

Close your eyes and imagine, for a moment, that you are the Head of Data of a biotech company. Your position is one of the most important of the company, especially in the data science and AI space. If data is garbage, the models will be garbage. If data is good, the models will be good, and the company will be successful. Your team can make a difference between the former and latter case.

However, managing a data team is not an easy task. If things go well, you’ll be overwhelmed by users requesting new datasets, but you will not have the resources to deliver them all. If things do not go well, users will start downloading things on their laptops, and ignoring the data pipelines you are building, bypassing all your efforts.

How to deal with this situation, as a Data leader?

In my opinion, one of the first steps is to understand that two atavic forces govern the needs of data users. The first force is the need for making the data available now, without making users wait for months while your team develops a data pipeline. If you spend too much time developing the perfect pipeline, your users will start going around you, downloading what they need on their laptops and forgetting about the data team. The second force is the need for having perfectly clean data and harmonized metadata. This is easier said than done, and it requires time and careful planning.

The medallion architecture from Databricks is a way to manage these two atavic forces. Essentially, data is distributed across three catalogs, called Bronze, Silver and Gold. The Bronze catalog is the messy place where, by definition, data is made available quickly, but in a raw form, without cleaning and curation. The Silver catalog is a mid-point between bronze and gold, where data is available in a reasonable form, after a reasonable investment of time from your team. The Gold catalog, instead, is the perfect and tidy place where clean and harmonized data products live, defined by user stories and use cases, and developed after months of investment.

The official definition of the Medallion Architecture from Databricks.

Let’s make an example.

A new paper is published in Nature, presenting a new sequencing technique that is different from what has been seen before. Your data scientists are very excited by this and want to try the data as soon as possible. However, as the data team leader, you know it will take months to develop a perfect pipeline for storing this data into your data lake. What do you do?

The first step is to download the data from the paper, and put it in a new folder in Bronze. You may add some minimal metadata to track where the data was downloaded from, but not much more than that. This way, the data scientists will be able to access the data immediately, without having to wait for anything else. The data will not be curated, but your data scientists will not care at that point; they just want to try it out and use it for experimentation. If you are using Databricks, the Unity Catalog will track any notebook and analysis done using the data, so any result will still be reproducible. The important thing here, is that the data scientists will use the data from the Catalog, instead of downloading it in a temporary position, keeping your data team in the loop.

After a few months, more papers are published using the same technique. So far, you’ve been downloading new data to Bronze, without curation. However, as more datasets accumulate, things become too messy. This is the right time to create some aggregate tables in Silver. For example, you could concatenate the gene counts (or any other fact data) from the raw Bronze datasets into a big table. You’ll likely want to add some metadata table, for example a projects one containing the list of datasets, and other metadata tables as relevant. You don’t spend too many resources on this; just make something sensible that makes your users happy.

After more time, a big initiative in your company focuses on using this new type of data for a specific goal. For example, the company wants to create a perfectly clean dataset for a drug discovery programme, integrating data from several modalities, including this one. A brave new Data Product Owner is assigned to the task, and they begin collecting user stories, designing data flows, outsourcing metadata curation services, and everything that is needed for a perfect product. This product will live in Gold; it will take some time to develop it, but once it is ready, it will be perfect for its purpose.

This is what I like about the Medallion Architecture. It is a framework for managing expectations, for data teams. Users will know that the Bronze data is not curated, but they will be fine with that, because that is the definition of Bronze. On the other hand, users will know that, if they want perfect clean and harmonized data, they need to ask leaders to put resources into it, so that proper data products can be developed in Gold. As these assumptions are intrinsic to the way the data is stored, data team leaders will face less pressure from the requests they receive.

MLFlow for Bioinformatics – the missing piece for reproducibility of results?

Giovanni Marco Dall'Olio — Thu, 23 Jan 2025 12:11:13 +0000

(Story originally published on Medium)

MLFlow is a tool used to manage the development of Machine Learning models, from their development to release in production. It is also a great tool for managing bioinformatics analyses, making it easier to test parameters and options and reproduce results. However, in my opinion, not many Bioinformaticians are aware of this tool and its advantages. In this article, I will show an example of using ML FLow to track a differential expression analysis and demonstrate how it is the missing piece to ensure reproducibility.

Example of using ML Flow for tracking a Differential Expression analysis. The two last columns shows two parameters I tested: lfcShrink (whether to apply LFC Shrinkage or not), and refit_cooks (whether to filter out potential outliers, based on Cooks distance). The “Metrics” columns track some metrics I computed for every analysis, such as the number of significant genes, the averge LogFC, and so on. The first columns provide info on the input dataset, the notebook, and when it was executed.

What is ML Flow?

MLFlow is a platform for managing the life cycle of a Machine Learning model. It keeps track of all the experiments done while training a Model, as data scientists try different parameters and algorithms to improve their predictions.

Suppose you are a Machine Learning scientist, tasked with developing a model to predict a specific variable. There are so many tools you can try?—?random forest, XGboost, etc..?—?and for each of these, there are so many parameters to be tested. How to keep track of all these options and choices?

The answer is ML flow. This tools allows to record information on parameters, inputs, and metrics every time we run an experiment.

The following screenshot is taken from the ML Flow documentation page. We can see a list of Tensorflow models, computed using different parameters lr and momentum (in the last two columns of the table). For each run, the metrics test_rmse is computed?—?this allows to quickly determine which run has the most optimal results.

A screenshot of ML Flow from its documentation page. Each row represents an experiment using tensorflow. The last two columns show the values of the parameters (lr and momentum) used for every run. The third-to-last column shows the test RMSE for each run, allowing to determine which run has the best results.

MLflow is much more than a registry of parameters and metrics. It allows to manage the whole life cycle of a model?—?from developing it, tagging specific versions, and pushing it to production. These are all valid applications in Bioinformatics as well?—?but for this article, I’ll just explain the basic usage of the Registry.

How can ML Flow be used for Bioinformatics?

Despite being developed to track Machine Learning experiments, ML Flow can be used to track any computational analysis. We don’t need to use scikit-learn or train Machine Learning models to use it. We can track anything that has parameters, and produces results. Here, I’m going to show how to use it for a Differential Expression analysis made with the PyDESeq2 Python Package.

Initializing a ML Flow run

The first step is to import the library and initialize an experiment. The code below may seem verbose, but it can be copied&pasted in all notebooks. In short, we define the Experiment name, and connect to it. If the experiment does not exists, we create it.

In the second cell, we start the ML Flow run. This will create a new row in the ML Flow registry for this experiment. The row will be empty, but we will add parameters and metrics later in the code.

Tracking Parameters

For this example, I am following the PyDESeq2 tutorial, almost verbatim.

I’ve added two parameters, to show how to keep track of them using mlflow. The first parameter is called “refit_cooks”, and determines whether we want to remove outliers, based on Cooks distance. This is a common step in a Differential Expression analysis, to remove samples that may have low quality for technical reasons. The second parameter is lfcShrink, to determine whether to apply LFC Shrinkage to the data?—?this is a technique used to clean up genes that have low fold change in the results.

The mlflow code to track parameters is quite simple. We run mlflow.log_params, and provide a dictionary of parameters we want to track. Assuming we have executed mlflow.start_run() earlier in the code, these parameters will be added to the current run.

Tracking Parameters in ML Flow

Tracking Input Data

Another useful thing to track when running a Differential Expression analysis is the input data.

If the dataset is small, we can track it using the mlflow.log_input() function as shown below. Otherwise, if the data is bigger, we can store it in a table, and provide the location. If the data is complex to represent as a table, we can also store it as an Artifact?—?a file that can is stored in the run, and can be accessed later. It all depends on how big the input data is, and how much space do you want to reserve for the runs data.

Tracking Input datasets in MLflow

Tracking Results and Metrics

I’m not going to paste the code for this Differential Expression analysis, because it is taken verbatim from the PyDESeq2 tutorial . Let’s assume we have completed our analysis, and obtained a dataframe of Genes that are differentially expressed in our comparison.

The results of our Differential Expression analysis, computed using the PyDESeq2 tutorial. The padj column tells us which genes are differentially expressed in cases vs control, and the log2FoldChange shows the magnitude of the change, and its sign.

Now that we have computed this table, we may want to keep track of some metrics, to determine whether we results are valid or not.

For a Machine Learning experiment, we could track accuracy, recall, MAE, RMSE, and other metrics.

In the case of a Differential Expression analysis, we don’t really have the equivalent of these metrics, but we can create others as we may see fit. For example, here I am computing the ones in the screenshot below:

Some of these metrics are self explanatory: for example, n_significant_genes shows the number of genes that are significant in the results dataset

Other metrics are based on the Biology. Here I’ve computed a metrics called “is_gene3_significant”, to determine whether Gene3 is differentially expressed between cases and controls. This may be a hypothesis that I make on the data?—?for example, this is an experiment where Gene 3 has been Knocked Out using gene editing or other reasons. As another example, we may know from literature that Gene3 is always expressed in the group of patients we are testing, and we may want to check whether that is the case.

There is an infinite amount of possibilities when tracking Metrics from an experiment. ML flow provides an efficient way to log these and navigate them across different runs.

The results

This screenshot shows what got recorded in ML FLow when running this analysis.

The last two columns show the parameters used for every analysis. These are my original parameters lfcShrink and refit_cooks, which I defined earlier in my code.

The other columns on the right show the metrics from every run. For example, you can see that when lfcShrink is set to True, the average Log2FC is lower?—?this is expected, as LFC Shrinkage is designed exactly for that purpose.

The columns on the right show information on the runs?—?like the input data, the time when the experiment was run, and so on.

What’s next?

This article described how ML Flow can be used to keep track of Bioinformatics tasks, such as Differential Expression.

In the past, I’ve tried many approaches to ensure reproducibility of analysis. I’ve used jupyter notebooks stored in git; and all sorts of approaches. However, I think ML Flow is the ultimate answer. Especially if integrated with environments like DataBricks, as in these examples.

There are infinite ways to design parameters and metrics for tracking the results of a differential expression, and other analysis in Bioinformatics. Hopefully this will give you a good primer, and help you ensuring your tasks are more reproducible and efficient.

Foundation Models for Bioinformatics – a Primer

Giovanni Marco Dall'Olio — Fri, 17 Jan 2025 11:54:06 +0000

(Original article published on Medium)

Foundation models for biology are one of the most significant technological advances in bioinformatics in recent years. However, they are built on concepts that can be relatively unfamiliar to people in the field, as they are derived from other areas of AI. This article summarises what you need to know about Foundation models in Biology and how they can be helpful to you.

Most people are now familiar with natural language models such as chatGPT, trained on huge quantities of text and capable of generating responses that mimic human language. Many people are also acquainted with generating images and videos using other architectures. However, biology also has its own language, which can be modelled using machine learning. In fact, biology has many languages, from DNA to proteins to transcription factors and regulation, interactions between cell types and tissues, and much else.

Foundation models for biology attempt to model these languages, using large quantities of data for training and applying this knowledge to generate new data or make predictions.

A list of Foundation Models for Biology

This list is quite incomplete, as new models get published every week. For a more complete list, check this repository https://github.com/HICAI-ZJU/Scientific-LLM-Survey (I hope they accept my pull requests for adding some recent ones). I also like following the https://kiinai.substack.com/ for recent news.

[Foundation models for biology](https://www.notion.so/17ec9afb0bec80489d86ed3a453a05c9?pvs=21)

What are Foundation Models for Biology?

Generally speaking, they are models trained on large quantities of biological data, such as genomic sequences or chemical structures, using architectures like the Transformer, which is widely used for training Large Language Models (LLMs) like ChatGPT. The term has become popular thanks to the paper https://arxiv.org/abs/2402.04286 published in 2024.
Instead of training a Transformer on a dataset of texts or human conversations, these models are trained on biological sequences or other types of biological data. The model learns the “language of life” by capturing hierarchical patterns in the data.
- For example, a model may be trained on gene expression data from blood tissue single-cell sequencing. The model learns which genes are typically expressed together in specific cell types discovering these interactions purely from analyzing the training data, without any prior knowledge of biological pathways or gene sets.

What Are “Pre-training” and “Fine-tuning”?

Training a Foundation model is very expensive and requires large quantities of data. However, once a model has been pre-trained, it can be used and fine-tuned for other tasks. This means that we can take advantage of big atlases (provided the model is made public), and use them on smaller datasets, without having to download all the original data and train on it from scratch.

Pre-training: This is the initial phase where the foundation model is trained on a large dataset of biological data. For example, the model might learn to predict the next nucleotide in a sequence or fill in missing elements within a sequence. During this phase, the model learns the patterns underlying the input data – essentially building an understanding of the “language of biology.”.
Fine-tuning: Once pre-trained, the model can be adapted to specific tasks by modifying its parameters and training it further on smaller, task-specific datasets. The advantage of this, compared to training a model from data directly, is that the model has already learned the relationship between the elements during the pertaining.
- A pre-trained model might have learned general genomic patterns. By fine-tuning, you could train it to classify patients as healthy or diseased based on their genomic data.
- Fine-tuning doesn’t necessarily involve modifying only the last layer. Techniques like updating multiple layers or using adapter modules can help leverage the full depth of the model.

Example 1: using a pre-trained model to predict disease

Let’s imagine you are developing a model to predict whether a patient is affected by a disease. You start from a gene expression dataset in cases/controls from previous experiments.

One traditional bioinformatics approach may be to create a data frame of gene expression across all samples and train a model to distinguish cases and controls. One may use a random forest approach, XGBoost, or logistic regression. The predicted variable will be “disease”, and the gene expression values will be the input.

The problem with this approach is that it doesn’t consider the relationships between genes. Gene A may be in the same pathway as Gene B, and the pair may be frequently expressed together. Gene C may be a transcription factor of Gene D, but only when gene E is expressed, and so on. Biology is so complex, and we are far from understanding it.

If your training dataset is big enough, your model may be able to learn about all the gene relationships by itself. Or you may tweak it with some feature engineering to simplify the training. However, in most cases, your training dataset will be far too small. You may have a cohort of 10, 20 samples at most. Even hundreds of samples may not be enough.

Here is where a foundation model comes into play. We can download a model previously trained on a large dataset of gene expression, like the Gene Expression Atlas, and apply it to our small dataset. We will transform the gene expression values into embeddings, which are numbers that include the knowledge of gene interactions learned during the pre-training. By using these embeddings instead of the original values, our predictions should be more accurate, and they will take into account all the biology that the foundation model has learned.

Example 2: Imputation and Upscaling of Microarrays

I like this example from the CpGPT paper https://www.biorxiv.org/content/10.1101/2024.10.24.619766v1.full, which presents a model trained on Methylation data.

Despite the advent of NGS technologies, Microarrays are still one of the most cost-effective ways to profile Methylation samples. There is a vast repository of data generated using old Illumina Microarray chips, like the HumanMethylation BeadChips, which can profile only 27K or 450K CpGs. There must be thousands of publications in Pubmed, and datasets in GEO, generated using these old technologies. The newest arrays can genotype more than 900K sites. It’s such a shame that this data is based on old technology and there are so many CpG sites missing.

Is there a way to use the data from all these old chips, and “update it” to the newest one, predicting the sites that were not present in the original assay? The answer, proposed in the CpGPT paper, is yes. They trained a transformer model on a large quantity of methylation data, including old and new chips. The model learned how to impute new sites, and predict the genotype of the missing CpGs based on data. To validate the results, they used the imputed datasets to predict age, and obtained much lower errors.

Similarly, a foundation model can be used to impute missing SNPs from Genotype data, and other data types. Foundation models can also be used for generating new datasets synthetically – the preciousGPT3 paper (https://www.biorxiv.org/content/10.1101/2024.07.25.605062v1describes this neatly.

What Are the Key Concepts of Transformers?

Transformers are at the core of foundation models. Here are the features that differentiate them from other architectures:

Embeddings:
– Input data (e.g., nucleotides, amino acids) is converted into numerical representations called embeddings.
– Embeddings encode both the identity and context of each element in the data. For example, in a nucleotide sequence, embeddings might capture the base type, its genomic position, and nearby elements.
– The art of creating embeddings lies in designing features relevant to the task. For example:
– Evo uses embeddings optimized for evolutionary data.
– scGPT encodes single-cell RNA-seq data, including metadata like cell type and experimental batch.
– CpGPT incorporates features like CpG methylation status, genomic position, and neighboring base context.
Attention Mechanism:
– Transformers use self-attention to understand relationships between different parts of the input. For example, the attention mechanism allows the model to connect distant elements in a sequence, such as regulatory elements and target genes. This ability to capture long-range dependencies is particularly crucial for biological data.
Generative Training:
– Many Transformers are trained using tasks like masked language modeling (e.g., predicting masked bases) or next-token prediction. These tasks help the model generalize across datasets and tasks by building a rich internal representation of the data.

Applicazioni di chatGPT e altri LLMs per la ricerca di farmaci

Giovanni Marco Dall'Olio — Fri, 13 Dec 2024 08:49:50 +0000

LLMs and foundation models are changing the way drug discovery is done – it’s still early days, but there is no denying it. However, when you talk about it with someone outside the field, they often have no idea what you are talking about. Some people have heard of chatGPT, but only a few have tried it.

For this reason, it is important to do some scientific divulgation and engage in communication with the broader public, because this technology will impact everyone. As we get closer to the holidays, this may be a good conversation topic for family and friends, between singing Christmas Carols and eating food :-).

This is a short presentation I gave this week at the “Salute e’ Donna” conference in my home town, Lanciano (Italy). I’m very thankful to the organizers from Fondazione Abruzzese per le Scienze della Vita ONLUS for inviting me. The slides are in Italian, and a bit of dialect, and they oversimplify a lot of details, but it has been fun to present them and engage with the audience during the seminar.

Book Lovers’ day 2022

Giovanni Marco Dall'Olio — Wed, 10 Aug 2022 13:02:50 +0000

Yesterday it was Book Lovers’ day! Thanks to Brady Todd for reminding me.

Here is a list of books I have read recently and recommend.

– The Maverick by Ricardo Semler: this is a great classic on Leadership and on how to organize a company. It tells the story of Semco, a Brazilian company where all employees self-organize themselves, everybody is free to set their salary and working hours, and there is mostly no hierarchy. I am planning to write a longer post about this book, because I believe many of these practices were used at GSK when I joined there, and the company was organized into small units resembling self-organizing startups. It’s amazing that this book has been written more than 30 years ago!

– The Phoenix Project by Gene Kit: A classic Agile book. It tells the story of a company that slowly evolves from a legacy Waterfall strategy to Agile. It is written from the perspective of a developer, and towards the end, I was almost crying as I was reading the pages. It was so painful to read about all the crashes and big security holes when they released the software for the first time, and so fun to see how they improved the process! Will they be able to do ten deployments per day, or crash in the process? This is definitely a nerdish read!

– Game Wizards by Jon Peterson: The story behind the creation of Dungeons & Dragons, the first role-playing game, almost 50 years ago. The book tells the horror story of the company founded by Gary Gygax, how it was mismanaged, and what terrible working environment it was. I’m glad D&D survived to these days, but it is sad to read how it was developed – at least, there are many things to learn from this story.

– Bad Blood by John Carreyrou: Another horror story about a start-up. This is the story of Theranos, the company founded by Elizabeth Holmes. Apart from being an amazing piece of journalism, this book is an example of everything you should not do when managing a company, and a collection of all the worst leadership practices there it could be in a company. I got very addicted to this book, and started watching videos on youtube to read more about this story.

– The Power of Ethics by Susan Liautaud: A collection of stories and points of reflection about ethics in current times, going from scandal of human gene editing by a Chinese scientist to the Boeing 737 Max 8 Jet scandal, and much more. The book proposes a framework to understand the impact of ethical choices and the way information and consequences spread in the community. Susan joined the company BenevolentAI I worked for as a member of the board, which is great news for us.

– Between Ape and Human by Forth Gregory: This is a book by an anthropologist that collected evidences that Homo Floresiensis may actually still be alive, according to tales and legends from people living in the Flores Island. The evidences unfortunately are not very strong, but there it may be some truth in it. This book made me want to leave everything and depart for Flores island :-)

– Human Kind by Rutger Bregman: I am halfway through this book, but I have already been enjoying it. Essentially it promotes the concept that humans are intrinsically good people. Current society and culture make us think that we are more egoistic and aggressive, but in reality, if you look at case by case, it is not true. Recommended read!

a Bioinformatician in the Big Pharma

Giovanni Marco Dall'Olio — Sun, 23 Jul 2017 19:39:09 +0000

The last 18 months have been quite a radical career change for me. This is because I made the infamous move: leaving the Academia and starting working in the Industry.

My career from Bologna to Barcelona, and from London to the British Countryside.

To be honest I am quite happy of the change. I’ve learned many things, discovered another way to do science, and possibly made some contributions. Moving from the Academia to Industry sometimes has a bad reputation, but these months taught me that to develop a drug, there are many resources to be involved: not only a smart idea in the lab, but also lot of validation, regulation, planning, marketing, budgeting, understanding the impact on the patients, and much more.

Where am I working exactly?

I am in the pre-clinical department of a big pharma company, GSK. More specifically my department is called Target Sciences, and the main scope is to identify and validate new targets (in layman terms: genes or biological entities) to treat indications (in layman terms: diseases or phenotypes).

The R&D department of GSK is structured in several Discovery Performance Units (DPUs), which are small independent units working on a specific therapy area. For example there it could be a DPU focused on Oncology, or another on Asthma and respiratory diseases. These DPUs are like small start-ups within the company, and they each carry out a few drug target through the drug discovery process.

Drug discovery process – I am in the first phase. Source: https://www.slidegeeks.com/shapes/product/business-steps-powerpoint-templates-marketing-drug-discovery-process-ppt-slides

My department helps all these DPUs identifying and evaluating drug targets, providing several computational biology expertise, together with genetics, stats, and experimental validation. It’s like a center of excellence which interacts with all the rest of R&D.

Identifying the correct target is important because it is the first decision in the drug development process, and an error in this step can be quite expensive. Imagine what happens when a clinical trial fails in phase III because the original drug was targeting the wrong gene: it is quite a big waste of resources, not only for the company but also and more importantly for the patients.

What is target identification, and what is my role

In layman terms, identifying a drug target involves answering the following question: if I want to treat disease X, which would be the best genes to target?

From a computational point of view, there are several ways to answer such question. You may simply go to the literature (e.g. pubmed) and search for relevant articles. Other approaches involve looking at information from several sources, like gene expression, protein interactions, involvement in pathways, and much more. It is usually a matter of data integration, or data science.

If you want to get a more general idea of the types of sources used for target identification, you can have a look at the Open Targets Platform; this is a pre-competive effort to curate and integrate data sources, supported by the EBI, GSK and other pharmas.

My role, in particular, is more focused on data integration and management than pure analysis. It is about making the best use of the datasets we have access to, and understanding what is the value of acquiring a new dataset. It is also about improving communication about data usage, and discovering new technologies and methods to make use of the data.

What is good about working in a pharma, compared to academia?

Let’s say three things:

Team Working. This is the answer that hurts the most, specially me.
If you look at the previous posts in this blog, you can see how much I care about doing science in a agile way, planning properly and sharing information. The problem is that in the academia, the pressure of having to publish first author papers ruins it all.
In the academic world there is a lot of collaboration, specially online, and team meetings and journal clubs; but at the end of the day, your long term prospects are all dependent on your own reputation in the scientific world. This is fair enough, but difficult to reconcile with real team working.
Lots to learn: everybody is usually involved in more diverse projects, and interact with more people from different background. Thus, you tend to specialize less in a specific area, and learn a bit of everything. To be honest, I prefer this approach as it keeps the attention higher. I am glad that I did a PhD, during which I spent several years specializing on a single area, human genetics; however, now that I got older I like learning more about different fields.
Possibility to grow. You are generally more pampered and cared than in the Academia. You are actively encouraged to follow courses and learn new technologies; and my line manager complains if I am still in the office after 6 pm. (to be honest my PhD supervisor also did). There are opportunities to do secondmends in other parts of the company, and learn about clinical trials, finance, or anything related. Every year you define a list of objectives with your line manager, and you are valued depending on how you reach them, in a fair process, and you are valued for your efforts and accomplishments.

What is Bad?

Politics. Unfortunately politics is everywhere, specially in a big international company. Luckily I am still unimportant enough, that this doesn’t affect me much.
Simplification. Interacting with people with different background means that you need to simplify and learn to explain complex biological concepts in a way that is easy to understand. This is not easy and sometimes lead to funny effects, e.g. when you start hearing buzz-words and simplifications. On the bright side, at least I am improving my communication skills.

What’s next?

For personal reasons I haven’t written much in this blog lately, and I may not be able to write much in the near future. However, hopefully I’ll be able to write more about this new adventure, and describe how science is done from the industry side.

Hacking Global Health London 2016

Giovanni Marco Dall'Olio — Fri, 09 Dec 2016 08:24:47 +0000

A few months ago I’ve participated in a Hackaton organized by Open Data Science on data from the Healthy Growth, Birth, Development knowledge integration (HBGDki) initiative by the Bill and Melinda Gates foundation.

The aim of this initiative is to collect data on child growth and development from several sources, to study which factors influence child growth and how to better intervene when there are risks. Currently the data comes from manual annotation of several publications, but future plans include launching a global effort to collect data systematically, and actually one of the objectives of the hackaton was to guide the planning of this effort.

I had a lot of fun during the hackaton and learned a lot. For me personally was an opportunity to learn more about the caret R package, which is a must-known library for doing machine learning in R. My plan for the hackaton was actually to do a trajectory clustering to see if there were different trajectories of growth of the baby during pregnancy, but unfortunately the analysis didn’t return very interesting results :-)

See my github repo for some jupyter notebooks, and the slides on slideshare for more info.

Published a “Post Publication Review” on Publons

Giovanni Marco Dall'Olio — Fri, 28 Oct 2016 16:40:28 +0000

A while ago I posted in this blog an analysis on fitness genes, illustrating an use of the Bioconductor data packages and based on a recently published paper (Are fitness genes more conserved across species?).

This week I have been contacted by the team of Publons and asked to paste the same analysis on their platform as a “Post Publication Review”. Of course I’ve accepted: Post Publication Review of High-Resolution CRISPR Screens Reveal Fitness Genes and Genotype-Specific Cancer Liabilities

Publons is a social network for peer reviewer, where you can list of papers you reviewed, get credit for it, and even post new reviews on published papers. I personally like the idea of Publons very much, because I think that reviewing papers is an important part of science, which unfortunately doesn’t get the recognition it deserves.

Hiding cows in the genome (a.k.a. an introduction to bash programming)

Giovanni Marco Dall'Olio — Fri, 14 Oct 2016 16:54:29 +0000

Preparing the materials for a workshop on bash programming is very difficult, because you never know which level of skill to expect from the people attending it.

Click on the image to access the slideshow.

Most of the times the class will be a mix of absolute beginners and expert Unix users, and it is not easy to prepare a presentation that will interest both. If the materials are too advanced, the beginners will get frustrated and stop paying attention. If the materials are too simple, expert users will get bored soon and get distracted, and start working on their own things and checking facebook.

In an attempt to avoid these issues, I’ve decided to go for a trick that hopefully would get the attention of even the most advanced bash guru, which is: hiding cows in the genome.

More precisely, for a workshop at the Programming for Evolutionary Biology conference held this year in Belgrade, I designed the exercises in a way that the instructions for the next step can be retrieved using the correct bash commands. Students start with a file of randomly generated text, and they have to use grep and other unix tools to proceed to the next exercise. If the exercise is done correctly, they also see a cow.

I think it worked decently, because the students liked the idea and finding cows in the fasta and bed files was fun.

The workshop’s materials are below. (if the iframe doesn’t work, click here). If you are a teacher and organize workshops on bash programming, here I am officially challenging you to include something similar in your next presentation :-)

[iframe src=”https://nbviewer.jupyter.org/format/slides/github/dalloliogm/belgrade_unix_intro/blob/master/PEB%20Bash%20Workshop.ipynb#” width=”100%” same_height_as=”window” scrolling=”yes”]

Data Annotation Packages in BioConductor

Giovanni Marco Dall'Olio — Thu, 13 Oct 2016 17:00:54 +0000

Bioconductor does not only contain analysis packages, but also a good suite of data packages, frozen from the most important data sources for bioinformatics (e.g. EBI, NCBI, UCSC, etc..).

These data packages are useful because because they allow to access certain biological relevant data quickly and without having to manually download them from the web. They are used internally by several analysis packages (e.g. to calculate ontology enrichment, get gene coordinates, etc..), and in a way they improve the reproducibility of your analysis, because by updating them within R you will access to the same version of the data frozen as for anybody else using them.

This slideshow provides a quick summary of all the data annotation packages available, how to use them and how this part of bioconductor is evolving.

Click on the screenshot or here to access the slideshow.

I’ve prepared the slideshow for the second workshop at the Programming for Evolutionary Biology in Belgrade I’ve presented this year. It is probably less glamorous than the Bash slideshow, as there are no hidden cows, however it may be more useful, specially if you use Bioconductor regularly.

Disclaimer: I am not a bioconductor developer, but just an user. So apologies if I wrote anything wrong :-)