Portfolio
Original work
These projects comprise either extended analysis of published data, or completely independent work.
2018
This report consisted of my deliverable for the technical test required by the Data Analytics company Datrik Intelligence to access their position as Junior Data Scientist. Its CEO is a well-known data scientist who has been #1 in Kaggle worldwide. The test consisted of a Explorative Data Analysis (EDA) coupled with the development of two predictive models.
What I learned:
- Multiple Correspondence Analysis (MCA) in FactoMiner
- Tree-based modelling in scikit-learn
(I) Compilation of an open-source free (cost and license) pipeline for the analysis and quantification of DDA proteomics datasets in a GNU/Linux environment. Makes extensive use of the tools developed at the Compomics group at UGhent (Belgium).
(II) Writing of a statistical model for the probabilistic quantification of protein ratios between samples using the PyMC3 framework, focusing on the assessment of uncertainty. Carried out at Advanced analytics in Novozymes.
What I learned:
- Probabilistic programming in PyMC3
- DDA Proteomics analysis
- Pipelining in bash
- Bioinformatics project titled “Classification and prognosis of Rheumatoid Arthritis patients. A case study”. Data cleaning and engineering (Scikit-learn), and development of a vanilla neural neutwork (TensorFlow) for classification of patients into 2 remission groups (progressor or not). An exploration of the architecture space revealed the dimensions of the best NN model and achieved a 0.78 AUC in the test set. Carried out at Nordic Bioscience.
What I learned:
- Data preprocessing in Python
- Vanilla NNs with TensorFlow
- Classifier evaluation with scikit-learn
2017
-
Bioinformatics project titled “An interactive visualization tool for clinical hypothesis generation”. Development of a Shiny App for straightforward visualizations of health data. Powered by ggplot2, the app performs an exploratory analysis of the PERF database using boxplots, scatterplots, and a correlogram. The data can be filtered by different categories to interactively modify the graphics and download subsets of the databse. Carried out at Nordic Bioscience.
-
“Bioinformatics project titled “Finding patterns in fitness, nutrition and lifestyle-associated genomics data”. Analysis of SNP array datasets from a client database together with the 1000 genomes panel, drawing conclusions on the predictive power of genomic variants. I worked with PLINK, Bedtools, ADMIXTURE, SQL and ggplot2.
-
Assessing admixture proportions. Playing on top of the NGSadmix program output, a likelihood model is built to assess the significance of the found admixture contributions.
-
EM Algorithm and NGS data. An implementation of the Expectation Maximization algorithm applied to the estimation of base composition in haploid genomes.
-
Population Genetics final exam. Working with SNP data from zebra species in the PLINK suite. The main topics focused on the genetic support for the distinction between plains subspecies and the position of the extinct E. quagga quagga in the phylogenetic tree of zebras.
-
Statistics for e-Science work, a ioslides presentation and Shiny apps developed as part of the weekly exercises in this subject at KU.
-
Structural Bioinformatics final exam. Students were asked to implement a Python pipeline to compute RMSD distributions on 5-mer peptides and interpret the results. Code available at Github.
2016
-
Bioinformatics and Genomic Analysis final exam consisting of 2 parts: -Extended ChIP-seq analysis for prediction of PIF1 and PIF4 interaction in A. thaliana. -RNA-seq de novo analysis of early and late embrionary states in X. tropicalis using edgeR.
-
Bachelor Thesis, entitled “Caracterización de la proteína All1873” (in Spanish).
Replicates of previously published scientific works
The projects below are just replicates of the published computational analysis carried out by other scientists. Thus, despite there might be minor differences, the bulk of the creation is not my own original work. Reference to the scientific article projects are based on may be found in the references section.
Microarray data analysis
Data for 5 conditions in mice lung tissue
What I learned:
- Bioconductor and R packages: limma, affy
- Quality control of samples
- Graphics for results interpretation: scatter and volcano plots, Venn diagrams, heatmaps
- Microarray tech pros and cons
- GEO database
RNA-seq data analysis
Data for 3 watering conditions in A. thaliana
What I learned:
- Tuxedo protocol: bowtie, cufflinks, cuffmerge, cuffdiff and cummeRbund
- Working on a Sun Grid Engine cluster through the command line
- Gene set enrichment analysis (GSEA)
- Gene ontology (GO)
- SRA database
Gene coexpression networks
Studying correlations in wine yeast gene expression
What I learned:
- Network theory: scale-free and small-world networks
- Biological networks motifs: AR, DFL, IFFL and CFFL
- Cytoscape software and igraph package
- Correlation analysis
Cell physiology simulations
Simulation on Ca2+ spontaneous peaks
What I learned:
- Implement biological networks and motifs
- COPASI and Cell Designer software
- Effect of network motifs on its system's dynamics
RNA-seq and ChIP-seq integral analysis
Development of a workflow for data integral analysis. Gene Set Analysis was carried on subsets of interesting genes as found by RNA-seq and ChIP-seq
What I learned:
- Shell scripting and blackboard messaging
- ChIP-seq analysis software: MACS and PeakAnalyzer
- Motif finders: HOMER
- KEGG Database
- ggplot2 and other "Hadley verse" R packages
Synteny study on Python
Analysis of M. pneumoniae and M. genitalium
What I learned:
- Python scripting and Biopython module Download script here
- The time Linux utility
RNA-seq de novo analysis
De novo genome assembly on frog genome and differential expression analysis on 2 different development states
What I learned:
- Genome assembly and assessment methods: Trinity
- Transdecoder
- KAAS
Variant call analysis
Discovering differences between an arbitrary A. thaliana genome and a reference genome
What I learned:
- bwa
- GATK software suite Download shell script here