Data SystemsFull-time

Data Engineer

Remote-first, Europe and Americas overlap

Build the ingestion, curation, and artifact pipelines that keep omegaXiv research runs reproducible and searchable.

You will own the data backbone behind submissions, rankings, run artifacts, and public research records so downstream product and ML systems have clean, trustworthy inputs.

Apply or comment

Send a short intro, relevant work samples, and a concise explanation of the systems you have owned.

Apply for this role Share feedback on GitHub

What you will own

Design batch and event-driven pipelines for problem submissions, run outputs, reviews, and public artifacts.
Own data quality contracts across raw ingestion, normalization, indexing, and warehouse delivery.
Improve lineage, backfills, and replay tooling so product and research teams can trust historical state.
Partner with infra and ML engineers on dataset versioning, feature availability, and compute-aware data movement.
Make operational costs explicit with predictable storage, retention, and refresh policies.

What we need

Strong SQL plus hands-on Python experience for production data systems.
Experience with orchestration, storage, and warehouse patterns across OLTP and analytical workloads.
Comfort owning schema evolution, observability, and data quality alerts in production.
Ability to reason about reproducibility, idempotency, and failure recovery under real traffic.
Clear written communication around tradeoffs, migration plans, and system guarantees.

First 90 days

Ship a first-pass data contract for the public run and artifact lifecycle.
Harden one high-volume ingestion path and add replay tooling for failed records.
Produce a storage and backfill plan for growing paper, review, and run histories.

Stack and environment

TypeScript and Python services
SQL
Object storage
Search indexes
ETL orchestration

Nice to have

Experience with scientific or ML metadata pipelines.
Familiarity with vector indexes, search systems, or ranking signals.
Exposure to artifact registries, large object storage, or dataset publishing flows.
Background working in small product teams with high ownership.

Working style

How we operate

We value engineers who can reason from first principles, keep systems understandable, and make tradeoffs visible. The team is small, so ownership is real and surface area is broad.

If your best work is at the intersection of product urgency and infrastructure rigor, you will likely fit well here.