Job Summary
As a Mid-Level Data Engineer at Revolab, you will architect and maintain the high-throughput pipelines that power our Voice AI and NLP engines. You won't just move text; you will manage multimodal data, audio, streaming logs, and metadata, ensuring it is processed with sub-millisecond reliability. You will bridge the gap between our backend services and our ML team, turning raw interaction data into high-quality training sets and actionable insights.
Key Responsibilities
- Build and optimize automated pipelines for ingesting and transforming massive datasets from diverse sources (voice streams, gRPC logs, web scrapers).
- Develop robust workflows to preprocess and version "unstructured" data, specifically audio and text, ensuring it is ready for LLM and Speech model training.
- Work closely with backend developers to integrate data pipelines with core application services, ensuring seamless data extraction from production databases and real-time event streams.
- Contribute to the design of our data lake and warehouse strategy, focusing on efficient retrieval patterns for machine learning.
- Monitor and optimize pipeline latency and resource usage within our Kubernetes environment.
- Implement validation checks and monitoring to ensure data integrity, security, and compliance.
- Partner with ML Engineers to deliver "AI-ready" datasets, moving beyond simple ETL to feature engineering and data versioning.
Qualifications
Education:
- Bachelor’s degree in Data Science, Computer Science, Software Engineering, or related discipline. Master’s degree is a plus.
Technical Skills:
- Proficiency in Python and SQL for data engineering tasks.
- Familiarity with Apache Airflow, Spark, or similar ETL orchestration tools.
- Experience with data lakes, databases, and cloud platforms (AWS preferred).
- Understanding of data pipeline architecture and design principles.
- Exposure to unstructured and multimodal data processing.
- Basic understanding of data security, privacy, and compliance requirements.
Soft Skills:
- Ability to deconstruct complex data bottlenecks into scalable engineering solutions.
- Understanding of how data flows through a distributed system, not just a single script.
- Eagerness to work in a fast-paced AI startup where data formats and model requirements evolve quickly.
- Ability to explain data constraints to stakeholders
Preferred Experience:
- Prior work on data projects supporting machine learning or AI development.
- Experience with data from regulated environments (e.g., finance, healthcare).
- Web scraping and Robotic Process Automation (RPA) knowledge.