Uber details Fiber, a framework for distributed AI model training

Uber details Fiber, a framework for distributed AI model trainingUber details Fiber, a framework for distributed AI model training

A preprint paper coauthored by Uber AI scientists and Jeff Clune, a research team leader at San Francisco startup OpenAI, describes Fiber, an AI development and distributed training platform for methods including reinforcement learning (which spurs AI agents to complete goals via rewards) and population-based learning. The team says that Fiber expands the accessibility of large-scale parallel computation without the need for specialized hardware or equipment, enabling non-experts to reap the benefits of genetic algorithms in which populations of agents evolve rather than individual members. Fiber — which was developed to power large-scale parallel scientific computation projects like POET — is available in open source as of this week, on Github. It supports Linux systems running Python 3.6 and up and Kubernetes running on public cloud environments like Google Cloud, and the research team says that it can scale to hundreds or even thousands of machines. As the researchers point out, increasing computation underlies many recent advances in machine learning, with more and more algorithms relying on distributed training for processing an enormous amount of data. (OpenAI Five, OpenAI’s Dota 2-playing bot, was trained on 256 graphics cards and 1280,000 processor cores on Google Cloud.) But reinforcement and population-based methods pose challenges for reliability, efficiency, and flexibility that some frameworks fall short of satisfying. Fiber addresses these challenges with a lightweight strategy to handle task scheduling. It leverages cluster management software for job scheduling and tracking, doesn’t require preallocating resources, and can dynamically scale up and down on the fly, allowing users to ...
More on: venturebeat.com