We are looking for talented individuals to join our team in 2027. As a graduate, you will get opportunities to pursue bold ideas, tackle complex challenges, and unlock limitless growth. Launch your career where inspiration is infinite at our Company.
Successful candidates must be able to commit to an onboarding date by end of year 2027. Please state your availability and graduation date clearly in your resume.
Team Introduction: Our Arch-Data Ecosystem team plays a crucial role in the data ecosystem of the TikTok Recommendation System, focusing on creating offline and real-time data storage solutions for large-scale recommendation, search, and advertising businesses, serving over 1 billion users. The core goals of the team are to ensure high system reliability, uninterrupted service, and smooth data processing. We are committed to building a storage and computing infrastructure that can adapt to various data sources and meet diverse storage requirements, ultimately providing efficient, cost-effective, and user-friendly data storage and management tools for the business.
Topic Content: Building a unified infrastructure that integrates the "training data base" and "training/inference state system" for multimodal foundation models in search, recommendation, and advertising scenarios. Through collaborative optimization of data lakes, caching, distributed computing, and GPU IO, we aim to reduce training and inference costs for foundation models while improving iteration efficiency.
Responsibilities:
- Design and implement real-time and offline data architecture for large-scale recommendation systems.
- Build scalable and high-performance streaming Lakehouse systems that power feature pipelines, model training, and real-time inference.
- Collaborate with ML platform teams to support PyTorch-based model training workflows and design efficient data formats and access patterns for large-scale samples and features.
- Own core components of our distributed storage and processing stack, from file format to stream compaction to metadata management.Minimum Qualification(s):
- Individuals who are completing or recently completed a PhD in Software Development, Computer Science, Computer Engineering, or a related technical discipline.
- Experience building large-scale distributed systems, preferably in storage, stream processing, or ML infrastructure.
- Understanding of Apache Flink internals, with hands-on experience in state management, connectors, or UDFs.
- Familiarity with modern Lakehouse technologies such as Apache Paimon, Iceberg, Delta Lake, or Hudi, especially around incremental ingestion, schema evolution, and snapshot isolation.
Preferred Qualification(s):
- Experience in designing and optimizing Flink + Paimon architectures for unified batch/stream processing.
- Familiarity with feature storage and training data pipelines, and their integration with PyTorch, especially for large-scale model training.
- Knowledge of columnar file formats (Parquet, ORC, Lance) and how they are used in feature engineering or ML data loading.
- Proficiency in Java/Scala/C++, and strong debugging/performance tuning ability.
- Previous experience in Lakehouse metadata management, compaction scheduling, or data versioning.
- Knowledge of legacy data stores like HBase/Kudu.