Data SystemsFull-time
Data Engineer
Remote-first, Europe and Americas overlap
Build the ingestion, curation, and artifact pipelines that keep omegaXiv research runs reproducible and searchable.
You will own the data backbone behind submissions, rankings, run artifacts, and public research records so downstream product and ML systems have clean, trustworthy inputs.
Apply or comment
Send a short intro, relevant work samples, and a concise explanation of the systems you have owned.
Apply for this roleShare feedback on GitHubWhat you will own
- Design batch and event-driven pipelines for problem submissions, run outputs, reviews, and public artifacts.
- Own data quality contracts across raw ingestion, normalization, indexing, and warehouse delivery.
- Improve lineage, backfills, and replay tooling so product and research teams can trust historical state.
- Partner with infra and ML engineers on dataset versioning, feature availability, and compute-aware data movement.
- Make operational costs explicit with predictable storage, retention, and refresh policies.
What we need
- Strong SQL plus hands-on Python experience for production data systems.
- Experience with orchestration, storage, and warehouse patterns across OLTP and analytical workloads.
- Comfort owning schema evolution, observability, and data quality alerts in production.
- Ability to reason about reproducibility, idempotency, and failure recovery under real traffic.
- Clear written communication around tradeoffs, migration plans, and system guarantees.
First 90 days
- Ship a first-pass data contract for the public run and artifact lifecycle.
- Harden one high-volume ingestion path and add replay tooling for failed records.
- Produce a storage and backfill plan for growing paper, review, and run histories.
Stack and environment
- TypeScript and Python services
- SQL
- Object storage
- Search indexes
- ETL orchestration
Nice to have
- Experience with scientific or ML metadata pipelines.
- Familiarity with vector indexes, search systems, or ranking signals.
- Exposure to artifact registries, large object storage, or dataset publishing flows.
- Background working in small product teams with high ownership.
Working style
How we operate
We value engineers who can reason from first principles, keep systems understandable, and make tradeoffs visible. The team is small, so ownership is real and surface area is broad.
If your best work is at the intersection of product urgency and infrastructure rigor, you will likely fit well here.