Full-Lifecycle Data Practitioner
resume.juandedios.info · 2026 EN · ES

Juan de Dios Alvarez

Data Scientist  ·  Data Engineer  ·  Geospatial Analyst  ·  ML Practitioner

I build what the data team does not yet have — the pipeline that replaces the spreadsheet, the model that goes beyond the notebook, the dashboard that needs no user manual. Twenty years across particle physics, NLP, geospatial analysis, agricultural finance, healthcare, education, and media: different domains, the same discipline. Quality first. Accessible outputs. Value demonstrable before the scale-up.

20+ Years with data
7 Industries
CERN International research credential
01

Three problems I solve.

1
Data Infrastructure & Governance

“We have data scattered across spreadsheets, vendor feeds, and siloed systems with no single source of truth.”

At FIRA (Mexico's national agricultural development trust), I replaced an institution's ad-hoc Excel workflows with an automated, governed pipeline — integrating satellite vegetation indices, climate risk models, and regulatory databases into credit decisions for agricultural producers. Credit analysts adopted it without technical training. Not because it was simple — because it was designed for them.

At D3Clarity, I built the analytics practice from scratch, adding MDM and data lake infrastructure for international clients. Semarchy xDM certified — MDM, Data Profiling, Data Lineage, Golden Records.

FIRA · D3Clarity · Semarchy xDM · AWS · PostgreSQL
2
Unstructured Text & NLP

“We have thousands of documents — meeting minutes, contracts, reports — and can't extract anything useful from them.”

At D3Clarity, I built the full NLP pipeline: automated scraping of public school board meeting minutes → ML topic classification → Power BI dashboard mapping sales opportunities per district. Public records turned into a production commercial tool.

At BairesDev, a RAG and NLP pipeline made unstructured legal contracts queryable — insight extraction per lawyer, per contract. My MSc thesis addressed automatic text classification in 2009 — this work predates the LLM era.

D3Clarity · BairesDev · MSc ML/NLP · INAOE 2009
3
ML to Production

“Our models work in notebooks. They don't work in the organization.”

Four years at D3Clarity: every project followed PoC → production, without exception. At CITEIM, Transformer-based models for continuous sign language recognition — fully local inference pipeline, on-device. No cloud dependency by design.

At the HAWC Observatory, distributed computing pipelines processed petabyte-scale cosmic ray datasets across US and Mexican institutions. The physics background is not incidental — it is where production-grade scientific software is built under real constraints.

D3Clarity · CITEIM · HAWC · Azure ML · SageMaker
02

Work history.

2025 – 2026 FIRA Data Scientist Digital transformation of Mexico's national agricultural credit process. Automated pipeline replacing manual workflows — satellite, climate, and regulatory data integrated into repeatable, governed credit assessments. Adopted by non-technical analysts without retraining.
2024 – 2025 CITEIM Data Scientist Transformer-based models for continuous sign language recognition. Fully local inference pipeline — on-device processing, no cloud dependency; deliberate architectural choice to reduce latency and eliminate operational cost. Current-generation AI applied to a direct human communication problem with social impact.
2023 – 2024 Punto Singular Data Scientist & Team Lead KPI and Quality dashboards (R, Python, Tableau, Looker) for business stakeholders. Financial market streaming prototype on AWS (WebSocket pipelines). Multi-sector client engagements: EdTech, environmental monitoring, industrial analytics. Founded an internal Data Science group — students solving real problems for nonprofits and clients.
2023 BairesDev Data Scientist Graph database modeling (Neo4j) for relationship analysis. Transformer-based text summarization. RAG + NLP pipeline for legal contracts — clause extraction, obligation mapping, per-lawyer portfolio querying. AirTable data curation.
2019 – 2023 D3Clarity Data Scientist Built the analytics practice from scratch — expanding the company's offering beyond MDM. Four years: NLP pipelines, data lakes, ETL, forecasting models, MDM, executive dashboards for international clients. Agile/Scrum with Jira. Every project: PoC to production.
2004 – 2019 HAWC · Tec MTY · UMSNH Researcher · Lecturer · Consultant PhD candidate at the HAWC Observatory — petabyte-scale cosmic ray data, distributed computing, Bayesian spectral analysis (14 co-authored publications). Teaching at Tec de Monterrey. IT infrastructure and sysadmin at UMSNH. Consulting across real estate, media, healthcare, education, and public sector.
2002 CERN / U. Lausanne DAQ System Member Data acquisition system for a Positron Emission Tomography scanner (ClearPET project). Signal conversion and data handling at CERN, under CONACYT-CERN cooperation. The benchmark for data precision was set here.
03

Academic foundation.

Education
2009 – 2013
PhD Candidate — Physics
UMSNH · HAWC Collaboration (Mexico / United States)
Cosmic ray composition analysis at petabyte scale. GPA 10.0. 14 co-authored publications. Multinational experiment spanning US and Mexican research institutions.
2006 – 2009
MSc — Computer Science (ML / NLP)
INAOE — National Institute of Astrophysics, Optics and Electronics
Thesis: “Automatic text classification using prototype-based class reduction.” The foundation for every NLP engagement that followed — written in 2009, before the label existed.
August 2009
CERN School of Computing
Georg-August-Universität Göttingen · CERN CSC2009 · 6 ECTS
Scientific computing in high-energy physics. Distributed systems, advanced computational methods.
1998 – 2003
BSc — Physics & Mathematics
Universidad Michoacana de San Nicolás de Hidalgo
Thesis: “Use of a pixel silicon detector in radiography.” Data acquisition and signal processing at the instrumentation level — where most data quality problems originate.
Certifications
AWS Cloud Practitioner
AWS Well-Architected Framework · Cloud infrastructure fundamentals · 2023
Semarchy xDM Certified
Master Data Management · Data Profiling · Data Lineage · Golden Records · Governance · 2022

Contributing member of The HAWC Collaboration — an international multi-institution observatory operating at 4,100m altitude, Sierra Negra, Mexico. PhD research published in Phys. Rev. D (2022). The Collaboration's research has appeared in Science (2017), among other peer-reviewed journals and international conference proceedings.

Publication record on ORCID
04

Technical capabilities.

Core Languages

Python SQL R Julia C / C++ Java Shell

ML & AI

NLP / LLMs Transformers scikit-learn Geospatial ML Forecasting Classification RAG Gen AI

Data Engineering

ETL / Pipelines MDM / Governance PostgreSQL Neo4j Oracle / MySQL Linux / SysAdmin Git

Cloud & MLOps

AWS EC2 / SageMaker Azure ML GCP Agile / Scrum Jira

Visualization

Power BI Tableau Apache Superset Looker Airtable

Methods

Data Governance Data Literacy Bayesian Analysis HPC / Distributed Team Leadership Mentorship
Languages
Español Native
English Professional · 4+ yrs international
Italiano Basic–Intermediate · In progress
Français Basic

Available to relocate. Open to international positions. Logistical transition is not a constraint.