MLabs

Research Crawling Engineer

🇵🇱 Zdalnie, PL Zdalnie IT Opublikowano Kwi 28, 2026
LokalizacjaZdalnie, PL
Tryb pracyZdalnie
KategoriaIT
Kategoria ITInżynier danych
Opublikowano28 kwietnia 2026
Ostatnio sprawdzono6 maja 2026

Location: Remote - Must have a 6 hour overlap with EST

Remote | Full-time

Compensation: $150K - $225K

We are hiring on behalf of our client who is a technical infrastructure firm specializing in the delivery of massive-scale web data to organizations developing advanced artificial intelligence models. The organization supports high-capacity bandwidth-sharing networks and operates a distributed crawler capable of accessing high-quality public web data at a global scale. Additionally, the team has engineered sophisticated pipelines for the ingestion, segmentation, and annotation of billions of multimedia files, facilitating dataset creation for frontier research labs.

The organization operates as a lean, technical team that prioritizes speed and direct execution. As a Research Crawling Engineer, the successful candidate will design and operate large-scale web data acquisition systems. This role encompasses distributed systems, scraping infrastructure, and data pipelines, focusing on providing high-quality inputs for research and model development.

Key Responsibilities

  • Construct and maintain large-scale web crawlers across diverse domains.
  • Design high-throughput, fault-tolerant systems for data collection, managing volumes ranging from millions to billions of URLs per day.
  • Navigate anti-bot systems, rate limits, and dynamic, JavaScript-heavy websites.
  • Develop robust pipelines for data cleaning, deduplication, filtering, and normalization.
  • Build and maintain datasets specifically structured for research and machine learning model training.
  • Monitor and optimize crawl performance, coverage, and data quality through rapid iteration.
  • Collaborate with research teams to ensure data collection efforts align with modeling requirements.
  • Optimize infrastructure to ensure cost-efficiency, low latency, and reliability.

Zanim odejdziesz

Zostaw swój adres e-mail, aby śledzić tę ofertę i otrzymywać trafne powiadomienia. Możesz też kontynuować bez udostępniania go.