Portfolio

Original work

These projects comprise either extended analysis of published data, or completely independent work.

2018

Technical test for the Junior Data Scientist position at Datrik Intelligence”

This report consisted of my deliverable for the technical test required by the Data Analytics company Datrik Intelligence to access their position as Junior Data Scientist. Its CEO is a well-known data scientist who has been #1 in Kaggle worldwide. The test consisted of a Explorative Data Analysis (EDA) coupled with the development of two predictive models.

What I learned:

Multiple Correspondence Analysis (MCA) in FactoMiner
Tree-based modelling in scikit-learn

Master Thesis titled “Development of label-free quantification methods in proteomics”.

(I) Compilation of an open-source free (cost and license) pipeline for the analysis and quantification of DDA proteomics datasets in a GNU/Linux environment. Makes extensive use of the tools developed at the Compomics group at UGhent (Belgium).

(II) Writing of a statistical model for the probabilistic quantification of protein ratios between samples using the PyMC3 framework, focusing on the assessment of uncertainty. Carried out at Advanced analytics in Novozymes.

What I learned:

Probabilistic programming in PyMC3
DDA Proteomics analysis
Pipelining in bash

Bioinformatics project titled “Classification and prognosis of Rheumatoid Arthritis patients. A case study”. Data cleaning and engineering (Scikit-learn), and development of a vanilla neural neutwork (TensorFlow) for classification of patients into 2 remission groups (progressor or not). An exploration of the architecture space revealed the dimensions of the best NN model and achieved a 0.78 AUC in the test set. Carried out at Nordic Bioscience.

What I learned:

Data preprocessing in Python
Vanilla NNs with TensorFlow
Classifier evaluation with scikit-learn

2017

Bioinformatics project titled “An interactive visualization tool for clinical hypothesis generation”. Development of a Shiny App for straightforward visualizations of health data. Powered by ggplot2, the app performs an exploratory analysis of the PERF database using boxplots, scatterplots, and a correlogram. The data can be filtered by different categories to interactively modify the graphics and download subsets of the databse. Carried out at Nordic Bioscience.
“Bioinformatics project titled “Finding patterns in fitness, nutrition and lifestyle-associated genomics data”. Analysis of SNP array datasets from a client database together with the 1000 genomes panel, drawing conclusions on the predictive power of genomic variants. I worked with PLINK, Bedtools, ADMIXTURE, SQL and ggplot2.
Assessing admixture proportions. Playing on top of the NGSadmix program output, a likelihood model is built to assess the significance of the found admixture contributions.
EM Algorithm and NGS data. An implementation of the Expectation Maximization algorithm applied to the estimation of base composition in haploid genomes.
Population Genetics final exam. Working with SNP data from zebra species in the PLINK suite. The main topics focused on the genetic support for the distinction between plains subspecies and the position of the extinct E. quagga quagga in the phylogenetic tree of zebras.
Statistics for e-Science work, a ioslides presentation and Shiny apps developed as part of the weekly exercises in this subject at KU.
Structural Bioinformatics final exam. Students were asked to implement a Python pipeline to compute RMSD distributions on 5-mer peptides and interpret the results. Code available at Github.

2016

Bioinformatics and Genomic Analysis final exam consisting of 2 parts: -Extended ChIP-seq analysis for prediction of PIF1 and PIF4 interaction in A. thaliana. -RNA-seq de novo analysis of early and late embrionary states in X. tropicalis using edgeR.
Bachelor Thesis, entitled “Caracterización de la proteína All1873” (in Spanish).

Replicates of previously published scientific works

The projects below are just replicates of the published computational analysis carried out by other scientists. Thus, despite there might be minor differences, the bulk of the creation is not my own original work. Reference to the scientific article projects are based on may be found in the references section.

Microarray data analysis

Data for 5 conditions in mice lung tissue

What I learned:

Bioconductor and R packages: limma, affy
Quality control of samples
Graphics for results interpretation: scatter and volcano plots, Venn diagrams, heatmaps
Microarray tech pros and cons
GEO database