Junior Python Developers - US
DataForce by TransPerfect seeks experienced Python Developers to build and own scalable data pipelines for LLMs, transforming large datasets into clean, model-ready data and supporting AI training.
Work Location: Remote, within the US
Engagement Model: Freelancer/Independent Contractor
Start Date: ASAP
DataForce by TransPerfect is looking for skilled Python Developers to architect, build, and own the data pipelines that power large language model (LLM) development.
Your primary mission will be to build scalable, automated systems that transform massive raw datasets into clean, model-ready formats. While your focus will be on data engineering, your expertise will also be valuable in collaborating on model training runs and experiments.
You are a strong fit for this role if you are a Python expert who thrives on solving large-scale data challenges and enjoys working at the intersection of data engineering and machine learning.
Role Responsibilities
Design, develop, and own robust, scalable, and automated ETL/ELT pipelines in Python to ingest and process terabyte-scale text datasets.
Implement rigorous data cleaning, deduplication, filtering, and normalization strategies, and define and enforce data quality standards to ensure high integrity for model training.
Efficiently structure and format diverse datasets (e.g., JSON, Parquet) for consumption by LLM training frameworks.
Work closely with AI researchers and ML engineers to understand data requirements, define metrics, and support the model training lifecycle.
Continuously optimize data processing workflows for performance, cost efficiency, and reliability.
Occasionally assist with launching, monitoring, and debugging data-related issues during model training runs.
Role Requirements
1–5 years of professional experience in Python development, data engineering, data processing, or backend software engineering.
Bachelor’s degree in a technical field such as Software Engineering, Computer Science, Information Technology, Mechanical Engineering, or a related discipline.
Expert-level proficiency in Python and its data ecosystem (e.g., Pandas, NumPy, Dask, Polars).
Proven experience building and maintaining large-scale data pipelines.
Deep understanding of data structures, data modeling, and software engineering best practices (Git, CI/CD, testing).
Experience handling and parsing diverse data formats (JSON, CSV, XML, Parquet) at scale.
Excellent problem-solving skills and a meticulous attention to detail.
Strong communication and collaboration skills, with experience working in a team environment.
Preferred Role Requirements
Hands-on experience with the data preprocessing pipeline for an LLM (e.g., LLaMA, BERT, GPT-family).
Experience with big data frameworks like Apache Spark or Ray.
Experience with Hugging Face libraries (Transformers, Datasets, Tokenizers).
Familiarity with ML frameworks like PyTorch or TensorFlow.
Proficiency with cloud platforms (AWS, GCP, Azure) and their data/storage services.
DataForce by TransPerfect is part of the TransPerfect family of companies, the world’s largest provider of language and technology solutions for global business, with offices in more than 100 cities worldwide.
We offer high-quality data for Human-Machine Interaction to some of the most prestigious technology companies in the world. Our department focuses on gathering, enriching and processing data for Machine Learning in different AI domains. To learn more about DataForce please visit us at https://www.transperfect.com/dataforce.
TransPerfect provides equal employment opportunity to all individuals regardless of their race, color, creed, religion, gender, age, sexual orientation, national origin, disability, veteran status, or any other characteristic protected by state, federal, or local law. For more information on the TransPerfect Family of Companies, please visit our website at www.transperfect.com.
- Locations
- Remote, United States
- Remote status
- Fully Remote
- Employment type
- Contract