Job Summary
As a Data Engineer, you play a crucial role in the foundation of our data-driven operations, tasked with developing and maintaining a robust data ecosystem. You will design and implement complex data architectures that can handle varied and large-scale datasets from multiple, diverse sources such as texts, images, audio, and video. Your responsibilities include creating and optimizing automated data pipelines for seamless data collection, preprocessing, and storage, ensuring the infrastructure supports real-time and batch data processing needs. Furthermore, you will focus on enhancing the performance and scalability of data systems while maintaining stringent data security and compliance protocols to protect sensitive information.
Key Responsibilities
- Data Collection: Gather publicly available multimodal data (text, image, audio, video) from multiple sources, including websites, podcasts, news, social media, and reports.
- Data Pipeline & Automation Development: Develop automated data collection, processing, and storage pipelines.
- Data Infrastructure: Design and implement scalable system infrastructure that can seamlessly handle big data and robust data pipelines.
- Data Preprocessing: Preprocess and clean datasets by removing noise and ensuring quality.
- Exploratory Data Analysis: Conduct analysis and exploration of collected datasets to uncover insights.
- Data Warehouse and Data Lake Implementation: Establish robust, scalable data storage solutions.
- Collaboration: Partner with data science and machine learning engineering teams to meet the data needs supporting ongoing model development.
- Data Quality Assurance: Ensure the accuracy, consistency, and reliability of data.
- Optimization: Improve data pipeline performance and scalability for fast and efficient data processing.
- Data Security and Compliance: Maintain data security measures and ensure compliance with regulations.
General Qualifications
Bachelor's degree in Data Science, Artificial Intelligence, Software Engineering, or a related field. Master in Python programming language.
Experience and Skills
- Technical Skills:
- Extensive hands-on experience in designing robust data infrastructure and managing large-scale, complex data pipelines.
- Proficiency in Robotic Process Automation (RPA) and web scraping techniques for efficient data extraction.
- Proven experience in managing various types of data storage solutions, including databases, data lakes, and data warehouses.
- Strong Python and SQL programming abilities, with familiarity in using Jupyter Notebook, Git, and relevant data engineering libraries such as Pandas and NumPy.
- Solid foundation in data security, privacy, and compliance, ensuring adherence to industry standards and regulations.
- In-depth knowledge of concurrency, parallelism, and scaling, optimizing for high performance and efficiency.
- Expertise in data preprocessing and cleaning to prepare datasets effectively for machine learning model training.
- Familiarity with AWS infrastructure and a range of data engineering tools like Apache Spark, Airflow, and other ETL technologies.
- Soft Skills:
- Excellent communication and teamwork abilities.
- Strong problem-solving capabilities and critical thinking.
- Adaptable, eager to learn, and committed to continuous personal and professional development.
- Preferred Experience:
- In-depth experience and understanding of banking industry regulations related to data security and compliance.
- Proven track record of managing large-scale data pipelines for large language and speech training projects.
- Prior experience in the data preprocessing and preparation for AI training.
Work Environment
Office-based environment with potential for hybrid or remote work depending on company policy.
Reporting
This position reports to the Machine Learning Lead.
APPLY