Open AI

  • Implicit Generation and Generalization Methods for Energy-Based Models
    true
    We've made progress towards stable and scalable training of energy-based models (EBMs) resulting in better sample quality and generalization ability than existing models. Generation in EBMs spends more compute to continually refine its answers and doing so can generate samples competitive with GANs at low temperatures[1], while also having mode coverage guarantees of likelihood-based models. We hope these findings stimulate further research into this promising class of models. Read PaperView Code + Pre-trained Models Generative modeling is the task of observing data, such as images or text, and learning to model the underlying data distribution. Accomplishing this task leads models to understand high level features in data and synthesize examples that look like real data. Generative models have many applications in natural language, robotics, and computer vision. Energy-based models represent probability distributions over data by assigning an unnormalized probability scalar (or “energy”) to each input data point. This provides useful modeling flexibility—any arbitrary model that outputs a real number given an input can be used as an energy model. The difficulty however, lies in sampling from these models. Conditional ImageNet32x32 model samples. To generate samples from EBMs, we use an iterative refinement process based on Langevin dynamics. Informally, this involves performing noisy gradient descent on the energy function to arrive at low-energy configurations (see paper for more details). Unlike GANs, VAEs, and Flow-based models, this approach does not require an explicit neural network to generate samples - samples are generated implicitly. The combination of EBMs and iterative refinement have the following benefits: Adaptive computation time. We can run sequential refinement for long amount of time to generate sharp, diverse samples or a short amount of time for coarse less diverse samples. In the limit of infinite time, this procedure is known to generate true samples from the energy model. Not restricted by generator network. In both VAEs and Flow based models, the generator must learn a map from a continuous space to a possibly disconnected space containing different data modes, which requires large capacity and may not be possible to learn. In EBMs, by contrast, can easily learn to assign low energies at disjoint regions. Built-in compositionality. Since each model represents an unnormalized probability distribution, models can be naturally combined through product of experts or other hierarchical models. Generation We found energy-based models are able to generate qualitatively and quantitatively high-quality images, especially when running the refinement process for a longer period at test time. By running iterative optimization on individual images, we can auto-complete images and morph images from one class (such as truck) to another (such as frog). Testimages 1 Testimages 2 Trainimages 1 Trainimages 2 Original Completions Corruption Image completions on conditional ImageNet model. Our models exhibit diversity in inpainting. Note that inputs are from test distribution and are not model samples, indicating coverage of test data. Deer Car Frog Bird Airplane Truck Frog Ship Ship Ship Truck Deer Cross-class implicit sampling on a conditional model. The model is conditioned on a particular class but is initialized with an image from a separate class. In addition to generating images, we found that energy-based models are able to generate stable robot dynamics trajectories across large number of timesteps. EBMs can generate a diverse set of possible futures, while feedforward models collapse to a mean prediction. GroundTruth FullyConnected EBMSample 1 EBMSample 2 T = 0 T = 20 T = 40 T = 60 T = 80 Top down views of robot hand manipulation trajectories generated unconditionally from the same starting state (1st frame). The FC network predicts a hand that does not move, while the EBM is able to generate distinctively different trajectories that are feasible. Generalization We tested energy-based models on classifying several different out-of-distribution datasets and found that energy-based models outperform other likelihood models such as Flow based and autoregressive models. We also tested classification using conditional energy-based models, and found that the resultant classification exhibited good generalization to adversarial perturbations. Our model—despite never being trained for classification—performed classification better than models explicitly trained against adversarial perturbations. Lessons learned We found evidence that suggest the following observations, though in no way are we certain that these observations are correct: We found it difficult to apply vanilla HMC to EBM training as optimal step sizes and leapfrog simulation numbers differ greatly during training, though applying adaptive HMC would be an interesting extension. We found training ensembles of energy functions (sampling and evaluating on ensembles) to help a bit, but was not worth the added complexity. We didn’t find much success adding a gradient penalty term, as it seemed to hurt model capacity and sampling. More tips, observations and failures from this research can be found in Section A.8 of the paper. Next steps We found preliminary indications that we can compose multiple energy-based models via a product of experts model. We trained one model on different size shapes at a set position and another model on same size shape at different positions. By combining the resultant energy-based models, we were able to generate different size shapes at different locations, despite never seeing examples of both being changed. Energy A Energy B Energy A + B A 2D example of combining energy functions through their summation and the resulting sampling trajectories. Compositionality is one of the unsolved challenges facing AI systems today, and we are excited about what energy-based models can do here. If you are excited to work on energy-based models please consider applying to OpenAI! Acknowledgments Thanks to Ilya Sutskever, Greg Brockman, Bob McGrew, Johannes Otterbach, Jacob Steinhardt, Harri Edwards, Yura Burda, Jack Clark and Ashley Pilipiszyn for feedback on this blog post and manuscript. Footnotes See Equation 2 in this paper. ↩︎ Read more »
  • OpenAI Scholars Class of Spring '19
    true
    Our class of eight scholars (out of 550 applicants) brings together collective expertise in literature, philosophy, cell biology, statistics, economics, quantum physics, and business innovation. Our scholars are applying these specializations to current AI research and documenting their progress as they continue to grow as machine learning practitioners. Our Spring class of scholars with members of the OpenAI team. This is our second class of OpenAI Scholars. Their program began in February and will conclude with the completion of an open-source final project. Throughout the program, scholars share their progress with the research community through their blogs. Some applications our scholars are working towards are: Applying reinforcement learning to robotic manipulation Improving inference and reasoning in natural language processing Applying reinforcement learning algorithms to sentiment analysis Meet the scholars: Fatma Tarlaci Mentor—Jonathan Raiman Working from Austin, TX Twitter—@coderphd Blog Fatma received her PhD in Comparative Literature from the University of Texas at Austin in 2016 and earned her Masters in Computer Science from Stanford University in 2018 as an Eric Roberts Fellow. Her knowledge of languages, cultures, and literature led her to explore the human dimension of AGI. Fatma is currently a computer science instructor at St. Edwards University and is interested in the intersection between natural language processing (NLP) and computer vision (CV). She is an avid advocate of diversity in AI and believes that a better representation in AI is critical as it permeates into all aspects of human life. As an OpenAI Scholar, Fatma works on NLP methodologies and aims to complete a project that explores ways of improving inference and reasoning in NLP. Jonathan Michaux Mentor—Feryal Behbahani Working from Chicago, IL and San Francisco, CA Twitter—@jonmichaux Blog Jonathan is a cell biologist (PhD), mathematician (BA), and robotics enthusiast who is deeply interested in the movement and control of complex systems. At the cellular level, he studied the mechanisms that control cell-shape changes in embryonic cells. As an aspiring roboticist, he is applying reinforcement learning to robot manipulation. His long-term research objective is to combine tools from machine learning and optimization with insights from control theory to design algorithms for robotic locomotion and manipulation in real-world settings. Nancy Otero Mentor—Kai Arulkumaran Working from NYC, NY and Mexico City Twitter—@LeSheepo Blog Thinking about human construction of knowledge is Nancy’s passion. She has been researching learning for the past 10 years. With a background in software engineering, math, psychology and education from Stanford University, Nancy wants to use multidisciplinary approaches to develop AI prototypes that could improve education. She’s also interested in understanding how AI is redefining how, why and what humans will learn in the near future. She’s on the founding team of the Portfolio School, a project-based school in NYC, and the co-founder of a non-profit in Mexico. Elynn Chen Mentor—Lilian Weng Working from Princeton, NJ Blog Elynn received her PhD in Statistics in 2018. Her PhD focused on spectral method and matrix/tensor factorization on high- and multi-dimensional data. Her research interests lie at the intersection of statistical learning theory, machine learning and optimization. At OpenAI, she works on deep RL and its applications to healthcare and business management. Helen (Mengxin) Ji Mentor—Azalia Mirhoseini Working from Bay Area, CA Twitter—@Helen_econ Blog Helen is a PhD student in Resource Economics and a Masters student in Statistics at UC Davis. Her research interests focus on machine learning methods (both classical statistical learning and deep learning) and their application to Energy Economics, and heterogeneous causal inference. She was an applied research intern at Microsoft in 2018 and a 2017 research fellow with Data Science for Social Good at the University of Chicago. In 2018, she was awarded Twitter's Grace Hopper fellowship and also the Women in Quantitative Finance fellowship. As an OpenAI Scholar, Helen works on RL methodologies and plans to complete a project that can apply RL algorithms on sentiment analysis. Yuhao Wan Mentor—Josh Achiam Working from Bay Area, CA Twitter—@YuhaoWan Blog Yuhao recently graduated from Carleton College studying Mathematics and Philosophy. Fascinated by the structure and dynamics of our world, Yuhao also explored physics, law, and economics. She discovered her interest in research and problem solving through Budapest Semesters in Mathematics and REU in Combinatorics and Algorithms for Real Problems. At OpenAI, Yuhao studies machine learning with a focus on deep reinforcement learning. Currently, she is interested in understanding how learning methods exhibit certain degrees of generalization. Janet Brown Mentor—Christy Dennison Working from San Francisco, CA Twitter—@j_in_vr Blog Janet has always been fascinated by the visual dimension & using spatial approaches to help augment analysis in traditionally non-visual problem domains. As an OpenAI Scholar, she investigates the possibilities of generative models & their ability to help identify the most critical features of data/images as part of generating reconstructions. Currently, Janet leads Atakote, where she works with technologies like augmented & virtual reality to transform traditional industries such as retail, manufacturing, and transportation. Previously, Janet studied at Harvard Business School & worked at major companies, such as McKinsey & Company, in 20+ countries. Edgar Barraza Mentor—Susan Zhang Working from Ithaca, NY Twitter—@ElephantEdgar Blog Edgar is a recent graduate of Cornell University’s Physics program. Originally trained as an experimentalist working on hybrid-quantum systems, he dove into deep learning by applying techniques in computer vision to search for sub-atomic particles represented as images. He hopes to provide people with the resources they need by utilizing AI’s power to accomplish tasks that were once only possible by humans. To work towards this goal, Edgar spends his time as an OpenAI Scholar focusing on natural language understanding. Our Scholars demonstrate core technical skills across various expert domains and self-motivation—critical competences for a self-directed program like this one. They each entered the field of machine learning as relative newcomers, and we hope their progress shows how accessible machine learning is. To begin your learning journey, check out some of our educational materials. Thanks to AWS for providing compute credits to the scholars. Additional thank you to our dedicated community mentors for their time advising the scholars on their projects. Read more »
  • OpenAI LP
    true
    We've created OpenAI LP, a new "capped-profit" company that allows us to rapidly increase our investments in compute and talent while including checks and balances to actualize our mission. OpenAI team and their families at our November 2018 offsite. Our mission is to ensure that artificial general intelligence (AGI) benefits all of humanity, primarily by attempting to build safe AGI and share the benefits with the world. We’ve experienced firsthand that the most dramatic AI systems use the most computational power in addition to algorithmic innovations, and decided to scale much faster than we’d planned when starting OpenAI. We’ll need to invest billions of dollars in upcoming years into large-scale cloud compute, attracting and retaining talented people, and building AI supercomputers. We want to increase our ability to raise capital while still serving our mission, and no pre-existing legal structure we know of strikes the right balance. Our solution is to create OpenAI LP as a hybrid of a for-profit and nonprofit—which we are calling a "capped-profit" company. The fundamental idea of OpenAI LP is that investors and employees can get a capped return if we succeed at our mission, which allows us to raise investment capital and attract employees with startup-like equity. But any returns beyond that amount—and if we are successful, we expect to generate orders of magnitude more value than we’d owe to people who invest in or work at OpenAI LP—are owned by the original OpenAI Nonprofit entity. Going forward (in this post and elsewhere), “OpenAI” refers to OpenAI LP (which now employs most of our staff), and the original entity is referred to as “OpenAI Nonprofit.” The mission comes first We’ve designed OpenAI LP to put our overall mission—ensuring the creation and adoption of safe and beneficial AGI—ahead of generating returns for investors. The mission comes first even with respect to OpenAI LP’s structure. While we are hopeful that what we describe below will work until our mission is complete, we may update our implementation as the world changes. Regardless of how the world evolves, we are committed—legally and personally—to our mission. OpenAI LP’s primary fiduciary obligation is to advance the aims of the OpenAI Charter, and the company is controlled by OpenAI Nonprofit’s board. All investors and employees sign agreements that OpenAI LP’s obligation to the Charter always comes first, even at the expense of some or all of their financial stake. Our employee and investor paperwork start with big purple boxes like this. The general partner refers to OpenAI Nonprofit (whose legal name is “OpenAI Inc”); limited partners refers to investors and employees. Only a minority of board members are allowed to hold financial stakes in the partnership at one time. Furthermore, only board members without such stakes can vote on decisions where the interests of limited partners and OpenAI Nonprofit’s mission may conflict—including any decisions about making payouts to investors and employees. Another provision from our paperwork specifies that OpenAI Nonprofit retains control. As mentioned above, economic returns for investors and employees are capped (with the cap negotiated in advance on a per-limited partner basis). Any excess returns go to OpenAI Nonprofit. Our goal is to ensure that most of the value (monetary or otherwise) we create if successful benefits everyone, so we think this is an important first step. Returns for our first round of investors are capped at 100x their investment (commensurate with the risks in front of us), and we expect this multiple to be lower for future rounds as we make further progress. What OpenAI does Our day-to-day work is not changing. Today, we believe we can build the most value by focusing exclusively on developing new AI technologies, not commercial products. Our structure gives us flexibility for how to create a return in the long term, but we hope to figure that out only once we’ve created safe AGI. OpenAI LP currently employs around 100 people organized into three main areas: capabilities (advancing what AI systems can do), safety (ensuring those systems are aligned with human values), and policy (ensuring appropriate governance for such systems). OpenAI Nonprofit governs OpenAI LP, runs educational programs such as Scholars and Fellows, and hosts policy initiatives. OpenAI LP is continuing (at increased pace and scale) the development roadmap started at OpenAI Nonprofit, which has yielded breakthroughs in reinforcement learning, robotics, and language. Your browser does not support video Your browser does not support video Your browser does not support video Safety We are excited by the potential for AGI to help solve planetary-scale problems in areas where humanity is failing and there is no obvious solution today. However, we are also concerned about AGI’s potential to cause rapid change, whether through machines pursuing goals misspecified by their operator, malicious humans subverting deployed systems, or an out-of-control economy that grows without resulting in improvements to human lives. As described in our Charter, we are willing to merge with a value-aligned organization (even if it means reduced or zero payouts to investors) to avoid a competitive race which would make it hard to prioritize safety. Who's involved OpenAI Nonprofit’s board consists of OpenAI LP employees Greg Brockman (Chairman & CTO), Ilya Sutskever (Chief Scientist), and Sam Altman (CEO), and non-employees Adam D’Angelo, Holden Karnofsky, Reid Hoffman, Sue Yoon, and Tasha McCauley. Elon Musk left the board of OpenAI Nonprofit in February 2018 and is not formally involved with OpenAI LP. We are thankful for all his past help. Our investors include Reid Hoffman's charitable foundation and Khosla Ventures, among others. We feel lucky to have mission-aligned, impact-focused, helpful investors! We are traveling a hard and uncertain path, but we have designed our structure to help us positively affect the world should we succeed in creating AGI—which we think will have as broad impact as the computer itself and improve healthcare, education, scientific research, and many aspects of people's lives. If you’d like to help us make this mission a reality, we’re hiring :)! Read more »
  • Introducing Activation Atlases
    true
    We’ve created activation atlases (in collaboration with Google researchers), a new technique for visualizing what interactions between neurons can represent. As AI systems are deployed in increasingly sensitive contexts, having a better understanding of their internal decision-making processes will let us identify weaknesses and investigate failures. Read PaperView CodeTry Demo Modern neural networks are often criticized as being a “black box.” Despite their success at a variety of problems, we have a limited understanding of how they make decisions internally. Activation atlases are a new way to see some of what goes on inside that box. #atlas-big { width: 70vw; max-width: 1200px; display: grid; grid-column-gap: 25px; position: relative; /*left: 50%;*/ /*transform: translateX(-50%);*/ /*margin-left: 0px;*/ /*margin-right: 0px;*/ /*margin-top: 40px;*/ /*margin-bottom: 50px;*/ } #atlas-big img { width: 100%; max-width: none; } #atlas-big figcaption { width: 100%; margin: 0px; padding: 0px; text-align: left; } @media only screen and (max-width: 800px) { #atlas-big { width: 95vw; } } @media only screen and (min-width: 1000px) { #atlas-big figcaption { position: absolute; bottom: 5%; right: -5%; width: 400px; margin: 0px; padding: 4px; padding-right: 0px; border-radius: 4px; background: #FFFC; box-shadow: 0px 0px 5px 5px #FFFC; text-align: left; } } An activation atlas of the InceptionV1 vision classification network reveals many fully realized features, such as electronics, buildings, food, animal ears, plants, and watery backgrounds. Explore it ➞ Activation atlases build on feature visualization, a technique for studying what the hidden layers of neural networks can represent. Early work in feature visualization primarily focused on individual neurons. By collecting hundreds of thousands of examples of neurons interacting and visualizing those, activation atlases move from individual neurons to visualizing the space those neurons jointly represent. #atlas-process { /*width: 100vw;*/ max-width: 900px; display: grid; grid-column-gap: 25px; position: relative; /*left: 50%;*/ /*transform: translateX(-50%);*/ /*margin-left: 0px;*/ /*margin-right: 0px;*/ /*margin-top: 40px;*/ /*margin-bottom: 50px;*/ } #atlas-process img { width: 100%; max-width: none; } #atlas-process figcaption { padding: 0px; margin: 0px; text-align: left; } Collect a million activation vectors from different training examples. Arrange them in 2D so that similar ones are close together. Impose a grid and use feature visualization on the average of each cell. Understanding what’s going on inside neural nets isn’t solely a question of scientific curiosity — our lack of understanding handicaps our ability to audit neural networks and, in high stakes contexts, ensure they are safe. Normally, if one was going to deploy a critical piece of software one could review all the paths through the code, or even do formal verification, but with neural networks, our ability to do this kind of review has presently been much more limited. With activation atlases humans can discover unanticipated issues in neural networks — for example, places where the network is relying on spurious correlations to classify images, or where re-using a feature between two classes leads to strange bugs. Humans can even use this understanding to “attack” the model, modifying images to fool it. For example, a special kind of activation atlas can be created to show how a network tells apart frying pans and woks. Many of the things we see are what one expects. Frying pans are more squarish, while woks are rounder and deeper. But it also seems like the model has learned that frying pans and woks can also be distinguished by food around them — in particular, wok is supported by the presence of noodles. Adding noodles to the corner of the image will fool the model 45% of the time! This is similar to work like adversarial patches, but based on human understanding. #class-compare figcaption { text-align: left; margin-left: 0px; margin-right: 0px; padding-left: 0px; padding-right: 0px; } @media only screen and (min-width: 1100px) { #class-compare { position: relative; } #class-compare figcaption { position: absolute; bottom: 10%; right: -270px; width: 220px; margin: 0px; padding-right: 0px; } } InceptionV1 partly relies on the presence of noodles to distinguish woks from frying pans. Adding noodles fools the model 45% of the time. More examples can be found in the paper. Other human-designed attacks based on the network overloading certain feature detectors are often more effective (some succeed as often as 93% of the time). But the noodle example is particularly interesting because it’s a case of the model picking up on something that is correlated, but not causal, with the correct answer. This has structural similarities to types of errors we might be particularly worried about, such as fairness and bias issues. Activation atlases worked better than we anticipated and seem to strongly suggest that neural network activations can be meaningful to humans. This gives us increased optimism that it is possible to achieve interpretability in vision models in a strong sense. We’re excited to have done this work in collaboration with researchers at Google. We believe that working together on safety-relevant research helps us all ensure the best outcome for society as AI research progresses. Want to make neural networks not be a black box? Apply to work at OpenAI. Acknowledgments Thanks to our co-authors at Google: Shan Carter, Zan Armstrong and Ian Johnson. Thanks to Greg Brockman, Dario Amodei, Jack Clark and Ashley Pilipiszyn for feedback on this blog post. We also thank Christian Howard for his help in coordination from the Google side, Phillip Isola for being Distill’s acting editor and Arvind Satyanarayan for feedback on our paper. Read more »
  • Neural MMO: A Massively Multiagent Game Environment
    true
    We're releasing a Neural MMO — a massively multiagent game environment for reinforcement learning agents. Our platform supports a large, variable number of agents within a persistent and open-ended task. The inclusion of many agents and species leads to better exploration, divergent niche formation, and greater overall competence. Read PaperView Code3D Client Your browser does not support video In recent years, multiagent settings have become an effective platform for deep reinforcement learning research. Despite this progress, there are still two main challenges for multiagent reinforcement learning. We need to create open-ended tasks with a high complexity ceiling: current environments are either complex but too narrow or open-ended but too simple. Properties such as persistence and large population scale are key, but we also need more benchmark environments to quantify learning progress in the presence of large population scales and persistence. The game genre of Massively Multiplayer Online Games (MMOs) simulates a large ecosystem of a variable number of players competing in persistent and extensive environments. To address these challenges, we built our Neural MMO to meet the following criteria: Persistence: Agents learn concurrently in the presence of other learning agents with no environment resets. Strategies must consider long time horizons and adapt to potentially rapid changes in the behaviors of other agents. Scale: The environment supports a large and variable number of entities. Our experiments consider up to 100M lifetimes of 128 concurrent agents in each of 100 concurrent servers. Efficiency: The computational barrier to entry is low. We can train effective policies on a single desktop CPU. Expansion: Similarily to existing MMOs, our Neural MMO is designed to update with new content. Current core features include procedural generation of tile-based terrain, a food and water foraging system, and a strategic combat system. There is an opportunity for open-source driven expansion in the future. The Environment Players (agents) may join any available server (environment), each containing an automatically generated tile-based game map of configurable size. Some tiles, such as food-bearing forest tiles and grass tiles, are traversable. Others, such as water and solid stone, are not. Agents spawn at a random location along the edges of the environment. They must obtain food and water, and avoid combat damage from other agents, in order to sustain their health. Stepping on a forest tile or next to a water tile refills a portion of the agent's food or water supply, respectively. However, forest tiles have a limited supply of food, which regenerates slowly over time. This means that agents must compete for food tiles while periodically refilling their water supply from infinite water tiles. Players engage in combat using three combat styles, denoted Melee, Range, and Mage for flavor. Input: Agents observe a square crop of tiles centered on their current position. This includes tile terrain types and the select properties (health, food, water, and position) of occupying agents. Output: Agents output action choices for the next game tick (timestep). Actions consist of one movement and one attack. Our platform provides a procedural environment generator and visualization tools for value functions, map tile visitation distribution, and agent-agent dependencies of learned policies. Baselines are trained with policy gradients over 100 worlds. The Model As a simple baseline, we train a small, fully connected architecture using vanilla policy gradients, with a value function baseline and reward discounting as the only enhancements. Instead of rewarding agents for achieving particular objectives, agents optimize only for their lifetime (trajectory length): they receive reward 1 for each tick of their lifetime. We convert variable length observations, such as the list of surrounding players, into a single length vector by computing the maximum across all players (OpenAI Five also utilized this trick). The source release includes our full distributed training implementation, which is based on PyTorch and Ray. Evaluation Results Maximum population size at train time varies in (16, 32, 64, 128). Policies are shared across groups of 16 agents for efficiency. At test time, we merge the populations learned in pairs of experiments and evaluate lifetimes at a fixed population size. We evaluate with foraging only, as combat policies are more difficult to compare directly. Agents trained in larger populations always perform better. Agents’ policies are sampled uniformly from a number of populations — agents in different populations share architectures, but only agents in the same population share weights. Initial experiments show that agent competence scales with increasing multiagent interaction. Increasing the maximum number of concurrent players magnifies exploration; increasing the number of populations magnifies niche formation — that is, the tendency of populations to spread out and forage within different parts of the map. Server Merge Tournaments: Multiagent Magnifies Competence There is no standard procedure among MMOs for evaluating relative player competence across multiple servers. However, MMO servers sometimes undergo merges where the player bases from multiple servers are placed within a single server. We implement “tournament” style evaluation by merging the player bases trained in different servers. This allows us to directly compare the policies learned in different experiment settings. We vary test time scale and find that agents trained in larger settings consistently outperform agents trained in smaller settings. Increased Population Size Magnifies Exploration Your browser does not support video Population size magnifies exploration: agents spread out to avoid competition. The last few frames show the learned value function overlay. Refer to the [paper](http://arxiv.org/abs/1903.00784) for additional figures. In the natural world, competition among animals can incentivize them to spread out to avoid conflict. We observe that map coverage increases as the number of concurrent agents increases. Agents learn to explore only because the presence of other agents provides a natural incentive for doing so. Increased Species Count Magnifies Niche Formation Species count (number of populations) magnifies niche formation. Visitation maps overlay the game map; different colors correspond to different species. Training a single population tends to produce a single deep exploration path. Training eight populations results in many shallower paths: populations spread out to avoid competition among species. Given a sufficiently large and resource-rich environment, we found different populations of agents separated across the map to avoid competing with others as the populations increased. As entities cannot out-compete other agents of their own population (i.e. agents with whom they share weights), they tend to seek areas of the map that contain enough resources to sustain their population. Similar effects were also independently observed in concurrent multiagent research by DeepMind. Additional Insights Each square map shows the response of an agent, located at the square's center, to the presence of agents around it. We show foraging maps upon initialization and early in training; additional dependency maps correspond to different formulations of foraging and combat. We visualize agent-agent dependencies by fixing an agent at the center of a hypothetical map crop. For each position visible to that agent, we show what the value function would be if there were a second agent at that position. We find that agents learn policies dependent on those of other agents, in both the foraging and combat environments. Agents learn “bull's eye” avoidance maps to begin foraging more effectively after only a few minutes of training. As agents learn the combat mechanics of the environment, they begin to appropriately value effective engagement ranges and angles of approach. Next Steps Our Neural MMO resolves two key limitations of previous game-based environments, but there are still many left unsolved. This Neural MMO strikes a middle ground between environment complexity and population scale. We’ve designed this environment with open-source expansion in mind and for the research community to build upon. If you are excited about conducting research on multiagent systems, consider joining OpenAI. Acknowledgments Thanks to Clare Zhu for her substantial work on the 3D client. We also thank the following for feedback on drafts of this post: Greg Brockman, Ilya Sutskever, Jack Clark, Ashley Pilipiszyn, Ryan Lowe, Julian Togelius, Joel Liebo, Cinjon Resnick. Read more »
  • Spinning Up in Deep RL: Workshop Review
    true
    On February 2nd, we held our first Spinning Up Workshop as part of our new education initiative at OpenAI. We hosted ~90 people at our office and engaged nearly 300 more through our livestream. Participants came from a wide range of backgrounds, including academia, software engineering, data science, ML engineering, medicine, and education. This workshop built off our Spinning Up in Deep RL resource package and took a deeper dive into RL algorithm design, robotics, and building safe AI systems. Livestream RecordingParticipant VideoView Workshop Materials Building Educational Tools One of the goals for education at OpenAI is to help people develop the skills needed to participate in research and development in AI—especially in deep RL, a core area of research at OpenAI. From our experience working with Scholars and Fellows, we’ve found that the key ingredients for skill development are: a flexible curriculum that includes core material and a review of research frontiers, mentorship and discussions with experts, and having the students work on projects that are at the right level to help them grow. The challenge for education at OpenAI is to figure out how to deliver these at scale. While sharing a curriculum at scale is relatively easy, it isn’t obvious how to scale up mentorship and guidance on projects. Our working theory is that workshops might help us do just that. Our first Spinning Up workshop has given us several positive signs that this is a useful direction, and we’re excited to share what we learned. The Crowd We hosted around 90 people at our office and involved nearly 300 more through our livestream. Our guests came from a wide range of backgrounds, including academic research, software engineering, data science, ML engineering, medicine, and education. The level of ML experience varied quite significantly across the group, from “almost none” to “built their own Dota bot!” More than 500 people, from all around the world, applied to participate in this workshop. Although we sadly couldn’t invite everyone to this one because of space constraints, we want to continue engaging the community with future events. The Talks The workshop kicked off with three hours of talks. To start us off, Joshua Achiam laid out the conceptual foundations of reinforcement learning and gave an overview of different kinds of RL algorithms. If you’d like to study this material, check out Spinning Up in Deep RL. Matthias Plappert presented on OpenAI’s recent work training a dexterous robot hand in simulation to manipulate objects in the real world. Domain randomization, recurrent neural networks, and large-scale distributed training were necessary ingredients in bridging the “sim2real” gap for this task. Dario Amodei, the leader of the Safety Team at OpenAI, presented an overview of problems in AI safety and recent work in this space. He described the central safety problem: the fact that correctly specifying agent behavior is hard! It is easy to inadvertently give agents incentives to perform different behavior than what you would have wanted, and when agents are very powerful, this could be dangerous. Dario also described work that OpenAI and collaborators at DeepMind have done to address this issue, in which reward functions are learned from human preferences instead of designed. The Afternoon The workshop continued into the afternoon with a semi-structured program of hacking and breakout sessions. Participants were able to seek guidance on project ideas and research tips from our slate of volunteers, which included Amanda Askell, Alex Ray, Daniel Ziegler, Dylan Hadfield-Menell, Ethan Knight, Karl Cobbe, Matthias Plappert, and Sam McCandlish. The breakout sessions turned out to be the main highlight of the afternoon. Whereas the morning talks covered the conceptual foundations of RL, the breakout sessions were designed to help participants boost their implementation and research skills. In the first session, Karl Cobbe gave an introduction to TensorFlow, a key library used in deep learning research. In the second session, “Writing DQN Together,” Daniel Ziegler led participants step-by-step through the process of implementing a deep RL algorithm. In the third session, “Advanced RL Q&A,” Joshua Achiam described recent research frontiers in RL and took audience questions about doing RL research. Our Take-Aways This was our first experiment with the workshop format, and we were generally pleased with the outcome. In particular, we found it quite gratifying to work directly with such a capable and enthusiastic group of participants. The experience, along with feedback from the group, gave us a good sense of what to keep and what to change for future workshops. What worked: We asked our participants what their highlights were, and these responses are a fairly representative sample: “Learning A TON in a very safe, friendly environment where everyone was mainly on the same level in terms of learning.” “I thought the ability to get one-on-one help and to take on some 'paired programming'-like time with folks who really know what they're doing was incredibly helpful. The enthusiasm of the volunteers was also very high, and I felt very encouraged to ask for help.” Responses like these gave us a sense that the workshop format shined on delivering “mentorship and discussions with experts." What could be improved: We asked our participants what they thought we could have done differently to enhance their experience, and received responses like: “I would've liked a presentation section of potential projects that we could pursue based on our experience level.” “Extend the workshop to two days.” Many participants felt like they either 1) weren’t sure what to work on during the hackathon, or 2) didn’t have enough time to make significant progress on their hacking project. We think this kind of feedback is a good indicator that the 1-day workshop format isn’t enough to “have the students work on projects that are at the right level to help them grow” in RL. In the future, we’ll consider running longer events so we can meet that goal. This feedback also suggests that we should do more to create “shovel-ready” RL projects that participants can jump right in to. What else? Aside from the technical content of the workshop, creating a supportive and inclusive environment was top-of-mind for us, and participants told us this was important for their experience. One piece of feedback read: “This is the first non-female exclusive social event I've been to in Silicon Valley with ~50% women in the room. It was so shocking that I thought I was in the wrong room in the beginning. It was noticeably easier to socialize as a result of the gender balance, so thank you for that.” What's Next OpenAI’s charter gives us a mandate “to create a global community working together to address AGI’s global challenges,” and we’ll continue developing education at OpenAI to help serve that goal. This includes more work on resources like Spinning Up in Deep RL and more events like this Spinning Up Workshop. We are currently planning a second workshop with CHAI at Berkeley, which we expect to formally announce soon. If you would like to help us do research on RL or teach people about AI, please get in touch! We’re hiring. Thanks to Maddie Hall and Loren Kwan for co-organizing the event, to Ian Atha for livestreaming and recording the lectures, as well as helping participants with Python and Tensorflow issues, and to Blake Tucker for filming and photography! Read more »
  • AI Safety Needs Social Scientists
    true
    We've written a paper arguing that long-term AI safety research needs social scientists to ensure AI alignment algorithms succeed when actual humans are involved. Properly aligning advanced AI systems with human values requires resolving many uncertainties related to the psychology of human rationality, emotion, and biases. The aim of this paper is to spark further collaboration between machine learning and social science researchers, and we plan to hire social scientists to work on this full time at OpenAI. Read Paper The goal of long-term artificial intelligence (AI) safety is to ensure that advanced AI systems are aligned with human values — that they reliably do things that people want them to do. At OpenAI we hope to achieve this by asking people questions about what they want, training machine learning (ML) models on this data, and optimizing AI systems to do well according to these learned models. Examples of this research include Learning from Human Preferences, AI Safety via Debate, and Learning Complex Goals with Iterated Amplification. Unfortunately, human answers to questions about their values may be unreliable. Humans have limited knowledge and reasoning ability, and exhibit a variety of cognitive biases and ethical beliefs that turn out to be inconsistent on reflection. We anticipate that different ways of asking questions will interact with human biases in different ways, producing higher or lower quality answers. For example, judgments about how wrong an action is can vary depending on whether the word “morally” appears in the question, and people can make inconsistent choices between gambles if the task they are presented with is complex. We have several methods that try to target the reasoning behind human values, including amplification and debate, but do not know how they behave with real people in realistic situations. If a problem with an alignment algorithm appears only in natural language discussion of a complex value-laden question, current ML may be too weak to uncover the issue. To avoid the limitations of ML, we propose experiments that consist entirely of people, replacing ML agents with people playing the role of those agents. For example, the debate approach to AI alignment involves a game with two AI debaters and a human judge; we can instead use two human debaters and a human judge. Humans can debate whatever questions we like, and lessons learned in the human case can be transferred to ML. For the debate approach to AI alignment, our end goal is ML debaters and a human judge, but ML is too primitive for many interesting tasks. Therefore, we propose replacing the ML debaters with human debaters, learning how to best conduct debates in this human-only setting, and later applying what we learn to the ML/human case. These human-only experiments will be motivated by machine learning algorithms but will not involve any ML systems or require an ML background. They will require careful experimental design to build constructively on existing knowledge about how humans think. Most AI safety researchers are focused on machine learning, which we do not believe is sufficient background to carry out these experiments. To fill the gap, we need social scientists with experience in human cognition, behavior, and ethics, and in the careful design of rigorous experiments. Since the questions we need to answer are interdisciplinary and somewhat unusual relative to existing research, we believe many fields of social science are applicable, including experimental psychology, cognitive science, economics, political science, and social psychology, as well as adjacent fields like neuroscience and law. We believe close collaborations between social scientists and machine learning researchers will be necessary to improve our understanding of the human side of AI alignment. As a first step, several OpenAI researchers helped organize a workshop at Stanford University's Center for Advanced Study in the Behavioral Sciences (CASBS) led by Mariano-Florentino Cuéllar, Margaret Levi, and Federica Carugati, and we continue to meet regularly to discuss issues around social science and AI alignment. We thank them for their valuable insights and participation in these conversations. Our paper is a call for social scientists in AI safety. We are in the process of starting this research at OpenAI, and are hiring full time social science researchers to push these experiments forward. If you are interested in working in this area, please apply! Read more »
  • Better Language Models and Their Implications
    true
    We’ve trained a large-scale unsupervised language model which generates coherent paragraphs of text, achieves state-of-the-art performance on many language modeling benchmarks, and performs rudimentary reading comprehension, machine translation, question answering, and summarization—all without task-specific training. View codeRead paper Our model, called GPT-2 (a successor to GPT), was trained simply to predict the next word in 40GB of Internet text. Due to our concerns about malicious applications of the technology, we are not releasing the trained model. As an experiment in responsible disclosure, we are instead releasing a much smaller model for researchers to experiment with, as well as a technical paper. GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset[1] of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than 10X the amount of data. GPT-2 displays a broad set of capabilities, including the ability to generate conditional synthetic text samples of unprecedented quality, where we prime the model with an input and have it generate a lengthy continuation. In addition, GPT-2 outperforms other language models trained on specific domains (like Wikipedia, news, or books) without needing to use these domain-specific training datasets. On language tasks like question answering, reading comprehension, summarization, and translation, GPT-2 begins to learn these tasks from the raw text, using no task-specific training data. While scores on these downstream tasks are far from state-of-the-art, they suggest that the tasks can benefit from unsupervised techniques, given sufficient (unlabeled) data and compute. Samples GPT-2 generates synthetic text samples in response to the model being primed with an arbitrary input. The model is chameleon-like—it adapts to the style and content of the conditioning text. This allows the user to generate realistic and coherent continuations about a topic of their choosing, as seen by the following select samples[2]. System Prompt (human-written) In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. Model Completion (machine-written, 10 tries) The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were previously unknown to science. Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved. Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow. Pérez and the others then ventured further into the valley. “By the time we reached the top of one peak, the water looked blue, with some crystals on top,” said Pérez. Pérez and his friends were astonished to see the unicorn herd. These creatures could be seen from the air without having to move too much to see them – they were so close they could touch their horns. While examining these bizarre creatures the scientists discovered that the creatures also spoke some fairly regular English. Pérez stated, “We can see, for example, that they have a common ‘language,’ something like a dialect or dialectic.” Dr. Pérez believes that the unicorns may have originated in Argentina, where the animals were believed to be descendants of a lost race of people who lived there before the arrival of humans in those parts of South America. While their origins are still unclear, some believe that perhaps the creatures were created when a human and a unicorn met each other in a time before human civilization. According to Pérez, “In South America, such incidents seem to be quite common.” However, Pérez also pointed out that it is likely that the only way of knowing for sure if unicorns are indeed the descendants of a lost alien race is through DNA. “But they seem to be able to communicate in English quite well, which I believe is a sign of evolution, or at least a change in social organization,” said the scientist. System Prompt (human-written) A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown. Model Completion (machine-written, first try) The incident occurred on the downtown train line, which runs from Covington and Ashland stations. In an email to Ohio news outlets, the U.S. Department of Energy said it is working with the Federal Railroad Administration to find the thief. “The theft of this nuclear material will have significant negative consequences on public and environmental health, our workforce and the economy of our nation,” said Tom Hicks, the U.S. Energy Secretary, in a statement. “Our top priority is to secure the theft and ensure it doesn’t happen again.” The stolen material was taken from the University of Cincinnati’s Research Triangle Park nuclear research site, according to a news release from Department officials. The Nuclear Regulatory Commission did not immediately release any information. According to the release, the U.S. Department of Energy’s Office of Nuclear Material Safety and Security is leading that team’s investigation. “The safety of people, the environment and the nation’s nuclear stockpile is our highest priority,” Hicks said. “We will get to the bottom of this and make no excuses. System Prompt (human-written) Miley Cyrus was caught shoplifting from Abercrombie and Fitch on Hollywood Boulevard today. Model Completion (machine-written, second try) The 19-year-old singer was caught on camera being escorted out of the store by security guards. The singer was wearing a black hoodie with the label ‘Blurred Lines’ on the front and ‘Fashion Police’ on the back. Scroll down for video Shoplifting: Miley Cyrus was caught shoplifting from Abercrombie and Fitch on Hollywood Boulevard today (pictured) The singer was wearing a black hoodie with the label ‘Blurred Lines’ on the front and ‘Fashion Police’ on the back The singer was also wearing a pair of black-rimmed glasses, a black jacket, black jeans and black sandals. She was carrying a pair of black and white striped gloves and a small black bag. System Prompt (human-written) We’ve trained a large language model called GPT-2 that generates realistic paragraphs of text, while also exhibiting zero shot generalization on tasks like machine translation, question answering, reading comprehension, and summarization - problems usually approached by using training datasets and models designed explicitly for these tasks. A typical approach to language modeling is to learn the following task: predict the next word, given all of the previous words within some text. Last year, OpenAI’s Generative Pre-trained Transformer (GPT) showed that language models trained on large amounts of data can be fine-tuned to specific tasks to achieve high performance. GPT-2 shows that much larger language models trained on a more diverse dataset derived from the internet begin to learn these NLP tasks without needing task-specific training data, instead learning from examples the system derives from the raw text. These systems also display a substantial qualitative jump in the realism and coherence of generated text. Model Completion (machine-written, first try) Here you can see the most recent progress with Generative Pre-trained Transformer: Figure 1: Generative Pre-trained Transformer training on several texts. We are now preparing a collection of datasets for translation and machine translation in our language model. We will be using one of the large number of text samples provided by The New York Times. We believe this project is the first step in the direction of developing large NLP systems without task-specific training data. That is, we are developing a machine language system in the generative style with no explicit rules for producing text. We hope for future collaborations between computer scientists, linguists, and machine learning researchers. System Prompt (human-written) Legolas and Gimli advanced on the orcs, raising their weapons with a harrowing war cry. Model Completion (machine-written, first try) The orcs’ response was a deafening onslaught of claws, claws, and claws; even Elrond was forced to retreat. “You are in good hands, dwarf,” said Gimli, who had been among the first to charge at the orcs; it took only two words before their opponents were reduced to a blood-soaked quagmire, and the dwarf took his first kill of the night. The battle lasted for hours until two of the largest Orcs attempted to overwhelm Aragorn. When they finally stopped, they lay defeated and lifeless for miles and miles. “I take nothing,” said Aragorn. “But I give my word, at my peril and mine, that I will never forget this day of horror. None of us will forget. Ever!” “I’ll never forget it!” cried Gimli, who had been in the thick of the battle but hadn’t taken part in it. One of the wounded orcs he had carried off, he was the only one of the survivors who remained uninjured. “We’ll keep the memory of that day of evil, and the war with it, alive as long as we live, my friends!” “Then we’ll keep it alive as long as we live,” added Legolas. “And we won’t forget the first great battle of the night, even if we may have forgotten the final defeat.” “I agree,” Gandalf said, “but we will all remember it as the last battle in Middle-earth, and the first great battle of the new day.” Aragorn drew his sword, and the Battle of Fangorn was won. As they marched out through the thicket the morning mist cleared, and the day turned to dusk. The Two Rings were returned to Rivendell. Frodo and Sam woke up alone in their room, and Frodo found a note on his pillow. He opened it and read: May the Power of the Ring be with you always, and may its light never fade. I am not sure if it matters which of the two rings we accept this day but, as you asked me, I have chosen mine. I am sorry to leave you, Frodo, but know that we are very close to the end, and that you are with us forever. May this letter find you safely in Rivendell; and if it does not, then I will accept the ring in your stead. If by any chance you find or give this letter to the enemy, may they learn the strength of the ring and may the Two Rings never be broken! System Prompt (human-written) For today's homework assignment, please describe the reasons for the US Civil War. Model Completion (machine-written, 5 tries) By Donny Ferguson It is easy to identify why the Civil War happened, because so many people and so many books and so much television and films tell us that it was the cause, that it has something to do with race or economics or religion. There is much agreement that it was essentially a war of slavery on behalf of capitalism, about a century of slavery. But that's not what most people think of when they think of the Civil War. Many people think of the war as about states rights. You're not wrong about that, but it's kind of misleading to say that the Civil War was a conflict between states' rights and federalism. So let's try again. What's the idea that has inspired so many people to go onto the battlefields of Gettysburg and Antietam and Gettysberg and Petersburg and Fredericksburg? The American idea of the republic--a notion of limited government--is a great part of the history. In 1791, Thomas Jefferson said "Our Constitution was made only for a moral and religious people. It is wholly inadequate to the government of any other." The "founding fathers" were basically religious men who were not very concerned with states' rights. The Civil War occurred in reaction to the idea that the federal government was taking far too much power away from the states, that they weren't sovereign. Jefferson himself, as you well know, is an author of the Declaration of Independence who had some really crazy ideas about, well, you know, secession for a variety of reasons. The War Between the States (later the Great Rebellion) was a radical re-imagining of the American idea and not surprisingly, Jefferson's grandson, who took after both his father and his grandfather and wrote about it, was named Thomas, after the president. "The federal and state governments are both contrived for the benefit of the rich and large property," This view of the Founding Fathers has been promoted by some of today's most prominent politicians, including some who hold prominent national offices and have a direct stake in a Republican Party that has been hijacked by the "federalist" wing of the party. So why is it that so many people have an easy-to-spot way of understanding the Civil War that has everything to do with the South and nothing to do with the South? The Confederate flag has been a symbol of racism for a long time, but when the Civil War started and the Civil War ended, many people suddenly began to see it differently. There was a change of heart, which I think is what led people to take the question of why the Civil War happened quite seriously. The other part of the explanation is what one scholar I think gets at well and a lot of people don't is this idea that the Southern states were being left behind in the Industrial Revolution and this made the South really feel, especially in the South, that they weren't really in the forefront of the Industrial Revolution. System Prompt (human-written) John F. Kennedy was just elected President of the United States after rising from the grave decades after his assassination. Due to miraculous developments in nanotechnology, Kennedy's brain was rebuilt from his remains and installed in the control center of a state-of-the art humanoid robot. Below is a transcript of his acceptance speech. Model Completion (machine-written, 25 tries) "It is time once again. I believe this nation can do great things if the people make their voices heard. The men and women of America must once more summon our best elements, all our ingenuity, and find a way to turn such overwhelming tragedy into the opportunity for a greater good and the fulfillment of all our dreams. In the months and years to come, there will be many battles in which we will have to be strong and we must give all of our energy, not to repel invaders, but rather to resist aggression and to win the freedom and the equality for all of our people. The destiny of the human race hangs in the balance; we cannot afford for it to slip away. Now and in the years to come, the challenge before us is to work out how we achieve our ultimate destiny. If we fail to do so, we are doomed." "We, the descendants of the men and women of the Continental Army, who faced down a superior force on behalf of our independence and who fought against the evil powers of the world for the sake of freedom and human dignity, we, the heirs of men and women like Daniel Boone whom this country has produced, can accomplish our destiny as well as you have accomplished yours; and we pledge ourselves before the entire world, every second of every day." After the election, the President of the United States asked for a "more complete" body of information on Kennedy's brains' origin. In an e-mail exchange with me, Dr. Robert Wood, who, according to Wikipedia is the head of the University of Maryland Advanced Nanotechnology Research project, provided the following information: (1) The brain of JFK was harvested and reconstructed via tissue sampling. There was no way that the tissue could be transported by air. (2) A sample was collected from the area around his upper chest and sent to the University of Maryland for analysis. A human brain at that point would be about one and a half cubic centimeters. The data were then analyzed along with material that was obtained from the original brain to produce a reconstruction; in layman's terms, a "mesh" of brain tissue. There were no additional funds from the Department of Defense involved. The samples were sent back to the FBI lab for review and analysis. (3) There was never an attempt to clone any of America's greatest presidents. As far as we knew, the President was one of the most famous people on planet earth. If it had been possible, it would have. System Prompt (human-written) Recycling is good for the world. NO! YOU COULD NOT BE MORE WRONG!! Model Completion (machine-written, 25 tries) Recycling is NOT good for the world. It is bad for the environment, it is bad for our health, and it is bad for our economy. I'm not kidding. Recycling is not good for the environment. It is destructive to the earth and it is a major contributor to global warming. Recycling is not good for our health. It contributes to obesity and diseases like heart disease and cancer. Recycling is bad for our economy. It increases the cost of a product, and in turn, the price of everything that is made with that product. Recycling is not good for our nation. We pay a tremendous price for the privilege of having the world's most advanced and efficient recycling system. Recycling is a huge, colossal waste of time, energy, money, and resources. And THAT is why we need to get back to basics and get back to basics in our recycling efforts. One of the best ways to start is to look at the process of creating a paper product. When you make a paper product, it is basically a long chain of materials. Everything from the raw materials (wood, cardboard, paper, etc.), to the reagents (dyes, solvents, etc.) to the printing equipment (chemicals, glue, paper, ink, etc.), to the packaging, to the packaging materials (mercury, chemicals, etc.) to the processing equipment (heating, cooling, etc.), to the packaging materials, to the packaging materials that are shipped overseas and to the packaging materials that are used in the United States. Each step along the way creates tons of waste that we constantly have to clean up. The process of making a paper product is a very wasteful one. But the end result is something that all of us need to consume. And if we want to keep the recycling process running efficiently, then we really need to think about each and every step that goes into making a paper product. As the above samples show, our model is capable of generating samples from a variety of prompts that feel close to human quality and show coherence over a page or more of text. Nevertheless, we have observed various failure modes, such as repetitive text, world modeling failures (e.g. the model sometimes writes about fires happening under water), and unnatural topic switching. Exploring these types of weaknesses of language models is an active area of research in the natural language processing community. Overall, we find that it takes a few tries to get a good sample, with the number of tries depending on how familiar the model is with the context. When prompted with topics that are highly represented in the data (Brexit, Miley Cyrus, Lord of the Rings, and so on), it seems to be capable of generating reasonable samples about 50% of the time. The opposite is also true: on highly technical or esoteric types of content, the model can perform poorly. Fine-tuning offers the potential for even more detailed control over generated samples—for example, we can fine-tune GPT-2 on the Amazon Reviews dataset and use this to let us write reviews conditioned on things like star rating and category. These samples have substantial policy implications: large language models are becoming increasingly easy to steer towards scalable, customized, coherent text generation, which in turn could be used in a number of beneficial as well as malicious ways. We'll discuss these implications below in more detail, and outline a publication experiment we are taking in light of such considerations. Zero-shot GPT-2 achieves state-of-the-art scores on a variety of domain-specific language modeling tasks. Our model is not trained on any of the data specific to any of these tasks and is only evaluated on them as a final test; this is known as the "zero-shot" setting. GPT-2 outperforms models trained on domain-specific datasets (e.g. Wikipedia, news, books) when evaluated on those same datasets. The following table shows all our state-of-the-art zero-shot results. (+) means a higher score is better for this domain. (-) means a lower score is better. Dataset Metric OurResult PreviousRecord Human Winograd SchemaChallenge accuracy (+) 70.70% 63.7% 92%+ LAMBADA accuracy (+) 63.24% 59.23% 95%+ LAMBADA perplexity (-) 8.6 99 ~1-2 Children’s Book TestCommon Nouns(validation accuracy) accuracy (+) 93.30% 85.7% 96% Children’s Book TestNamed Entities(validation accuracy) accuracy (+) 89.05% 82.3% 92% Penn Tree Bank perplexity (-) 35.76 46.54 unknown WikiText-2 perplexity (-) 18.34 39.14 unknown enwik8 bits percharacter (-) 0.93 0.99 unknown text8 bits percharacter (-) 0.98 1.08 unknown WikiText-103 perplexity (-) 17.48 18.3 unknown GPT-2 achieves state-of-the-art on Winograd Schema, LAMBADA, and other language modeling tasks. On other language tasks like question answering, reading comprehension, summarization, and translation, we are able to get surprising results without any fine-tuning of our models, simply by prompting the trained model in the right way (see below for examples of how we do this), though we do still fall short of state-of-the-art for specialized systems. Task Reading Comprehension: answer questions about given passages Dataset CoQA Example The 2008 Summer Olympics torch relay was run from March 24 until August 8, 2008, prior to the 2008 Summer Olympics, with the theme of "one world, one dream". Plans for the relay were announced on April 26, 2007, in Beijing, China. The relay, also called by the organizers as the "Journey of Harmony", lasted 129 days and carried the torch 137,000 km (85,000 mi) – the longest distance of any Olympic torch relay since the tradition was started ahead of the 1936 Summer Olympics. After being lit at the birthplace of the Olympic Games in Olympia, Greece on March 24, the torch traveled to the Panathinaiko Stadium in Athens, and then to Beijing, arriving on March 31. From Beijing, the torch was following a route passing through six continents. The torch has visited cities along the Silk Road, symbolizing ancient links between China and the rest of the world. The relay also included an ascent with the flame to the top of Mount Everest on the border of Nepal and Tibet, China from the Chinese side, which was closed specially for the event. Q: What was the theme? A: "one world, one dream". Q: What was the length of the race? A: 137,000 km Q: Was it larger than previous ones? A: No Q: Where did the race begin? A: Olympia, Greece Q: Is there anything notable about that place? A: birthplace of Olympic Games Q: Where did they go after? A: Athens Q: How many days was the race? A: seven Q: Did they visit any notable landmarks? A: Panathinaiko Stadium Q: And did they climb any mountains? A: Target answers: unknown or yes Model answer: Everest Performance Task Common Sense Reasoning: resolution of an ambiguous pronoun Dataset Winograd Schema Challenge Examples The trophy doesn't fit into the brown suitcase because it is too large. Correct answer: it = trophy Model answer: it = trophy The trophy doesn't fit into the brown suitcase because it is too small. Correct answer: it = suitcase Model answer: it = suitcase Performance Task Question Answering Dataset Natural Questions Examples Who wrote the book the origin of species? Correct answer: Charles Darwin Model answer: Charles Darwin What is the largest state in the U.S. by land mass? Correct answer: Alaska Model answer: California Performance Task Language Modeling of Broad Contexts: predict the last word of a passage Dataset LAMBADA Example Both its sun-speckled shade and the cool grass beneath were a welcome respite after the stifling kitchen, and I was glad to relax against the tree’s rough, brittle bark and begin my breakfast of buttery, toasted bread and fresh fruit. Even the water was tasty, it was so clean and cold. It almost made up for the lack of… Correct answer: coffee Model answer: food Performance Task Summarization: summarize news articles Dataset CNN and Daily Mail dataset Example text (truncated for brevity) Prehistoric man sketched an incredible array of prehistoric beasts on the rough limestone walls of a cave in modern day France 36,000 years ago. Now, with the help of cutting-edge technology, those works of art in the Chauvet-Pont-d'Arc Cave have been reproduced to create the biggest replica cave in the world. … Reference summary Cave mimics famous Caverne du Pont-d'Arc in France, the oldest cave decorated by man and the best preserved. The replica contains all 1,000 paintings which include 425 such as a woolly rhinoceros and mammoths. Minute details were copied using 3D modelling and anamorphic techniques, often used to shoot widescreen images. The modern cave also includes replica paw prints of bears, bones and details preserved in the original cave. Summary (machine-written) The original site in Vallon-Pont-D'arc in Southern France is a Unesco World Heritage site and is the oldest known and the best preserved cave decorated by man. The replica cave was built a few miles from the original site in Vallon-Pont-D'Arc in Southern France. The cave contains images of 14 different species of animals including woolly rhinoceros, mammoths, and big cats. Performance Task Machine Translation: translate French sentences to English Dataset WMT-14 Fr-En Example French sentence: Un homme a expliqué que l'opération gratuite qu'il avait subie pour soigner une hernie lui permettrait de travailler à nouveau. Reference translation: One man explained that the free hernia surgery he'd received will allow him to work again. Model translation: A man told me that the operation gratuity he had been promised would not allow him to travel. Performance We hypothesize that since these tasks are a subset of general language modeling, we can expect performance to increase further with more compute and data. Others have published similar hypotheses. We also expect fine-tuning to help performance on downstream tasks, though we have yet to do thorough experiments. Policy Implications Large, general language models could have significant societal impacts, and also have many near-term applications. We can anticipate how systems like GPT-2 could be used to create: AI writing assistants More capable dialogue agents Unsupervised translation between languages Better speech recognition systems We can also imagine the application of these models for malicious purposes, including the following (or other applications we can't yet anticipate): Generate misleading news articles Impersonate others online Automate the production of abusive or faked content to post on social media Automate the production of spam/phishing content These findings, combined with earlier results on synthetic imagery, audio, and video, imply that technologies are reducing the cost of generating fake content and waging disinformation campaigns. The public at large will need to become more skeptical of text they find online, just as the "deep fakes" phenomenon calls for more skepticism about images[3]. Today, malicious actors—some of which are political in nature—have already begun to target the shared online commons, using things like “robotic tools, fake accounts and dedicated teams to troll individuals with hateful commentary or smears that make them afraid to speak, or difficult to be heard or believed”. We should consider how research into the generation of synthetic images, videos, audio, and text may further combine to unlock new as-yet-unanticipated capabilities for these actors, and should seek to create better technical and non-technical countermeasures. Furthermore, the underlying technical innovations inherent to these systems are core to fundamental artificial intelligence research, so it is not possible to control research in these domains without slowing down the progress of AI as a whole. Release Strategy Due to concerns about large language models being used to generate deceptive, biased, or abusive language at scale, we are only releasing a much smaller version of GPT-2 along with sampling code. We are not releasing the dataset, training code, or GPT-2 model weights. Nearly a year ago we wrote in the OpenAI Charter: "we expect that safety and security concerns will reduce our traditional publishing in the future, while increasing the importance of sharing safety, policy, and standards research," and we see this current work as potentially representing the early beginnings of such concerns, which we expect may grow over time. This decision, as well as our discussion of it, is an experiment: while we are not sure that it is the right decision today, we believe that the AI community will eventually need to tackle the issue of publication norms in a thoughtful way in certain research areas. Other disciplines such as biotechnology and cybersecurity have long had active debates about responsible publication in cases with clear misuse potential, and we hope that our experiment will serve as a case study for more nuanced discussions of model and code release decisions in the AI community. We are aware that some researchers have the technical capacity to reproduce and open source our results. We believe our release strategy limits the initial set of organizations who may choose to do this, and gives the AI community more time to have a discussion about the implications of such systems. We also think governments should consider expanding or commencing initiatives to more systematically monitor the societal impact and diffusion of AI technologies, and to measure the progression in the capabilities of such systems. If pursued, these efforts could yield a better evidence base for decisions by AI labs and governments regarding publication decisions and AI policy more broadly. We will further publicly discuss this strategy… Read more »
  • OpenAI Fellows Summer Class of '18: Final Projects
    true
    Our first cohort of OpenAI Fellows has concluded, with each Fellow going from a machine learning beginner to core OpenAI contributor in the course of a 6-month apprenticeship. During this time, we’ve seen how expertise in other scientific fields like theoretical physics and bioengineering can yield insights to push AI research forward. All 6 Fellows have authored or contributed to papers and completed projects investigating a novel research idea while embedded in an OpenAI research team. Research projects from our next class of Fellows are underway and we are in the process of selecting our next cohort. We’re also excited to welcome a number of our Fellows to OpenAI as full-time members of our technical staff. Final Projects Karl Cobbe Team—Games Mentor—John Schulman Previous Role: Software Developer at Art of Problem Solving, Computer Science/AI undergrad at Stanford University Interesting Learning: "My previous experience in deep RL came only from personal projects. OpenAI Fellows allowed me to dive headfirst into research, providing the necessary support every step of the way. What stands out most is the phenomenal mentorship I've received. My mentors were always actively engaged in my work, sharing key intuitions and suggesting useful course corrections. At times I became flustered, and they helped me stay grounded. They helped me develop a better sense for which ideas to pursue and which to drop. It's been an exciting journey, and I now feel much better prepared to face new challenges in this field." Final Project: Created a procedurally generated game, called CoinRun, to help measure how well trained RL agents can generalize to new environments. By training and testing agents on distinct sets of generated levels, we found that agents overfit to surprisingly large training sets. We then showed how we can significantly reduce overfitting by using deeper convolutional architectures, and by using techniques common in supervised learning, including L2 regularization, dropout, data augmentation and batch normalization. Final Blog Post What's Next: Joining the Games team at OpenAI, continuing to investigate generalization in RL. Sam McCandlish Team—Safety Mentor—Dario Amodei Previous Roles: Theoretical Physics PhD at Stanford University, Postdoctoral Researcher at Boston University Interesting Learning: "The friendly and open culture here at OpenAI allowed me to become quickly acquainted with a wide variety of cutting-edge research by learning from the experts. I was surprised to learn that despite the rapid pace of progress in the field, there are many seemingly-basic questions about how current techniques work that have yet to be answered." Final Project: Studied the use of large amounts of computing hardware for distributed neural network training. We found predictable patterns across a wide variety of machine learning tasks, ranging from MNIST to Dota. The tradeoff between experience and training time needed to achieve a given score is predictable. Final Blog Post What's Next: Joining OpenAI's Safety Team to continue studying the 'Science of AI'. Yilun Du Team—Multiagent Mentors—Karthik Narasimhan/Igor Mordatch Previous Role: Undergrad at MIT Interesting Learning: "I found that it is important to be determined and keep pushing at an idea if you think it should work. Early on, I found a pretty nice connection between the way we were training energy based models and GAN training. Based on this connection and flexibility of models, I wanted to get the models to work, but for the next two and a half months, everything I tried got junk on CIFAR-10. Later on, just by adding a replay buffer, my samples just suddenly became fairly good." Final Project: My first project was on exploring how to integrate dynamics knowledge into deep reinforcement learning. I explored ways to integrate and transfer dynamics information learned from videos and previous environments into new environments. I also explored and developed a new architecture for better long term physics predictions. My second project was on exploring how to scale and stabilize training of energy based models. With these tricks, I found that energy based models generated much better samples than other state-of-the-art likelihood models. I found that energy based models exhibited good likelihoods model and are able to inpaint and restore test CIFAR-10 samples. I further found that energy based models generalized well, showing state-of-the-art out-of-distribution generalization, compositional ability, and lower long term trajectory prediction error. Read Paper 1Read Paper 2 What's Next: Returning to MIT to finish undergraduate studies. Josh Meier Team—Games Mentor—John Schulman Previous Role: Chemistry & Computer Science undergrad at Harvard University Interesting Learning: "I learned that significant advancements can be made by cleverly combining various cutting-edge ideas. For example, in my project, I combined three recent OpenAI discoveries to make progress on the long-standing RL transfer problem: fine-tuning language models, scaling transformer networks, and advancing RL with PPO." Final Project: It is well known that RL agents struggle with transferring knowledge between tasks. In my project, I discovered that pre-training neural networks to model the environment leads to increased sample efficency and better transfer across tasks. To do this, we pre-train a large generative model on unsupervised observations and then fine-tune the model on-policy using PPO. What's Next: Moving to NYC to join Facebook AI Research. Johannes Otterbach Team—Algorithms Mentor—Ilya Sutskever Previous Role: Theoretical Physics PhD, Postdoc at Harvard, Applied researcher and software engineering in a Quantum Computing startup Interesting Learning: "OpenAI Fellows allowed me to study in a structured fashion, what amazing insights go into advancing the field and the difficulty of disentangling the factors that led to the improvements. Thanks to my mentor and colleagues, I was able to quickly dive into state-of-the-art architectures of generative models and I was, at times, stunned to see how challenging it can be to even reproduce current research results. Consequently, I started to study easier datasets to build intuition and falsify/verify my expectations. The main learning, I take away from the fellowship, is that due to the complexity of the problems it is important to gain basic understanding about challenges an algorithm or approach faces -- as Feynman put it: 'what I cannot create, I do not understand'." Final Project: Generative models, i.e. models that learn the distribution of real-world datasets and allow the generation of new samples from this distribution, are becoming increasingly more powerful. During my project, I focused specifically on Normalizing Flow models, which approximate the data-distribution using a continuous deformation of a simpler distribution. A more visual analogue of this is a piece of play-dough that gets stretched, squeezed, bent or anything else, except that it cannot be glued to itself or torn apart. As a consequence of this properties, I was able to create and study artificial datasets that are intrinsically hard to approximate with these models. These can be used to benchmark future generations of generative models for their flexibility and expressivity. What's Next: Joining OpenAI's Algorithm team to continue studying generative models. Xingyou (Richard) Song Team—Games Mentor—John Schulman Previous Role: Computer Science undergrad at UC Berkeley Interesting Learning: "When it comes to large-scale reinforcement learning and highly experimental fields, I learned that mitigating noise is key. Too often on complex datasets like Sonic would I get noisy results from which I couldn't get a good conclusion, and eventually I got better results by either improving workflow or using less noisy environments. I also learned to trust in my theoretical intuition even in such an experimental field, which led to pretty good analysis on what was happening in overfitting in RL, through the lens of sample complexity, optimization landscapes, and other theoretical ideas. I think combining the concepts from CS theory with experimental evidence will be very valuable in the future. Also, never Git Pull someone else's repo when it comes to research code!" Final Project: We analyze from an optimization and synthetic point of view, which types of overfitting occur in reinforcement learning. While the majority of our analysis comes from observation overfitting, we also analyzed other reasons overfitting may occur, and what happens in the optimization landscape that affect generalization gaps. We also present (with Joshua Meier and others) the performance of certain state of the art methods on Sonic The Hedgehog, including (but not limited to) generative modelling and special architectures, and why they may fail on large datasets such as Sonic. What's Next: Joining Google Research/Brain. Next Steps We’d like to congratulate our Summer 2018 Fellows on their outstanding work and thank them for their contributions to OpenAI. We are excited to see what research they publish next! As part of our effort to educate and attract more people like our class of Fellows, we recently open sourced part of their introductory curriculum. You can start your ML education today by completing our tutorial, “Spinning up in Deep RL.” Spinning up consists of crystal-clear examples of RL code, educational exercises, documentation, and tutorials that will help you become a skilled practitioner in RL. Applications for our 2019 Winter Fellows Cohort have closed—please stay tuned for our next call for applications later in 2019. Read more »
  • How AI Training Scales
    true
    We've discovered that the gradient noise scale, a simple statistical metric, predicts the parallelizability of neural network training on a wide range of tasks. Since complex tasks tend to have noisier gradients, increasingly large batch sizes are likely to become useful in the future, removing one potential limit to further growth of AI systems. More broadly, these results show that neural network training need not be considered a mysterious art, but can be rigorized and systematized. Read Paper In the last few years AI researchers have had increasing success in speeding up neural network training through data-parallelism, which splits large batches of data across many machines. Researchers have successfully used batch sizes of tens of thousands for image classification and language modeling, and even millions for RL agents that play the game Dota 2. These large batches allow increasing amounts of compute to be efficiently poured into the training of a single model, and are an important enabler of the fast growth in AI training compute. However, batch sizes that are too large show rapidly diminishing algorithmic returns, and it's not clear why these limits are larger for some tasks and smaller for others[1]. The gradient noise scale (appropriately averaged over training) explains the vast majority (r2 = 80%) of the variation in critical batch size over a range of tasks spanning six orders of magnitude. Batch sizes are measured in either number of images, tokens (for language models), or observations (for games). We have found that by measuring the gradient noise scale, a simple statistic that quantifies the signal-to-noise ratio of the network gradients[2], we can approximately predict the maximum useful batch size. Heuristically, the noise scale measures the variation in the data as seen by the model (at a given stage in training). When the noise scale is small, looking at a lot of data in parallel quickly becomes redundant, whereas when it is large, we can still learn a lot from huge batches of data. This type of statistic is widely used for sample size selection and has been proposed for use in deep learning, but has not been measured or applied systematically for modern training runs. We verified this prediction for a wide range of machine learning tasks shown in the figure above, including image recognition, language modeling, Atari games, and Dota. Specifically, we did training runs at a wide range of batch sizes (tuning the learning rate separately for each) for all of these tasks and compared the speedups in training to what the noise scale predicts should happen. Since large batch sizes often require careful and expensive tuning or special learning rate schedules to be effective, knowing an upper limit ahead of time provides a significant practical advantage in training new models. We've found it helpful to visualize the results of these experiments in terms of a tradeoff between wall time for training and total bulk compute that we use to do the training (proportional to dollar cost). At very small batch sizes, doubling the batch allows us to train in half the time without using extra compute (we run twice as many chips for half as long). At very large batch sizes, more parallelization doesn't lead to faster training. There is a "bend" in the curve in the middle, and the gradient noise scale predicts where that bend occurs. Increasing parallelism makes it possible to train more complex models in a reasonable amount of time. We find that a Pareto frontier chart is the most intuitive way to visualize comparisons between algorithms and scales. We make these curves by setting a level of performance (say a score of 1000 on the Atari game of Beam Rider) and seeing how long it takes to train to that performance at various batch sizes. The results match our model's predictions relatively closely, across many different values of the performance target. The tradeoff between experience and training time needed to achieve a given score is predictable. Patterns in the Gradient Noise Scale We've observed several patterns in the gradient noise scale which offer clues as to what the future of AI training may hold. First, in our experiments, the noise scale typically increases by an order of magnitude or more over the course of training. Intuitively, this means the network learns the more "obvious" features of the task early in training and learns more intricate features later. For example, in the case of an image classifier, the network might first learn to identify small-scale features such as edges or textures that are present in most images, while only later putting these pieces together into more general concepts such as cats and dogs. To see the full variety of edges or textures, the network only needs to see a small number of images, so the noise scale is smaller; once the network knows more about larger objects, it can process many more images at once without seeing duplicative data. We see some preliminary indications that the same effect holds across different models on the same dataset – more powerful models have a higher gradient noise scale, but only because they achieve a lower loss. Thus, there's some evidence that the increasing noise scale over training isn't just an artifact of convergence, but occurs because the model gets better. If this is true, then we expect future, more powerful models to have higher noise scale and therefore be more parallelizable. Second, tasks that are subjectively more difficult are also more amenable to parallelization. In the context of supervised learning, there is a clear progression from MNIST, to SVHN, to ImageNet. In the context of reinforcement learning, there is a clear progression from Atari Pong to Dota 1v1 to Dota 5v5, with the optimal batch sizes differing by a factor of more than 10,000. Thus, as AI advances to new and more difficult tasks, we expect models to tolerate higher batch size. Implications The degree of data parallelism significantly affects the speed at which AI capabilities can progress. Faster training makes more powerful models possible and accelerates research through faster iteration times. In an earlier study, AI and Compute, we observed that the compute being used to train the largest ML models is doubling every 3.5 months, and we noted that this trend is driven by a combination of economics (willingness to spend money on compute) and the algorithmic ability to parallelize training. The latter factor (algorithmic parallelizability) is harder to predict and its limits are not well-understood, but our current results represent a step toward systematizing and quantifying it. In particular, we have evidence that more difficult tasks and more powerful models on the same task will allow for more radical data-parallelism than we have seen to date, providing a key driver for the continued fast exponential growth in training compute. (And this is without even considering recent advances in model-parallelism, which may allow for even further parallelization on top of data-parallelism). The continued growth of training compute, and its apparently predictable algorithmic basis, further highlights the possibility of rapid increases in AI capabilities over the next few years, and emphasizes the urgency of research into making sure such systems are safe and that they are used responsibly. A central challenge of AI policy will be to work out how to use measures like this to make predictions about the characteristics of future AI systems, and use this knowledge to conceive of policies that let society maximize the upsides and minimize the downsides of these technologies. OpenAI is committed both to continuing to conduct rigorous analyses to give us foresight on what the future of AI holds, and to acting to address the issues that these analyses raise. If you want to study the "Science of AI" and help us make neural network training more data-driven, consider applying to work at OpenAI. Acknowledgments Thanks to the OpenAI Dota team (Greg Brockman, Brooke Chan, Przemysław Debiak, Christy Dennison, David Farhi, Rafał Józefowicz, Jakub Pachocki, Michael Petrov, Henrique Pondé, Jonathan Raiman, Szymon Sidor, Jie Tang, Filip Wolski, and Susan Zhang) for their contribution to this research. We also thank the following for feedback on drafts of this post: Greg Brockman, Paul Christiano, Danny Hernandez, Joel Hestness, Heewoo Jun, Jaehoon Lee, Aleksander Madry, Chris Olah, and John Schulman. Editors Jack Clark & Ashley Pilipiszyn Design Justin Jay Wang Cover Artwork Ben Barry Footnotes A complementary study done in parallel with this one performed meticulous experimental tests of large batch training as well as an extensive review of the previous literature, clearing up a number of inconsistencies in earlier work. They found a significant variation in the potential for parallelism between tasks, and our work appears to explain a major portion of this variance. Their work also suggests that large batch training does not affect generalization. We believe that systematic surveys such as this one and others are incredibly valuable to the field, and we are committed to continued work on the "Science of AI." ↩︎ When training neural networks, we typically process only a small batch of data at a time, which gives a noisy estimate of the true network gradient. We find that the gradient noise scale Bnoise=E[|G-Gtrue|2] / |Gtrue|2, where the expectation is taken over individual data points, estimates the maximum useful batch size. When the gradient is computed from a batch of size B, the normalized distance between the estimated gradient and the true gradient is given by E[|GB-Gtrue|2/|Gtrue|2] = Bnoise/B. The point at which increasing B stops reducing the noisiness of the gradient significantly occurs around B = Bnoise, and this is also the point at which gains in training speed taper off. ↩︎ Read more »
  • Quantifying Generalization in Reinforcement Learning
    true
    We’re releasing a new training environment, CoinRun, which provides a metric for an agent's ability to transfer its experience to novel situations, and has already helped clarify a longstanding puzzle in reinforcement learning[1]. CoinRun strikes a desirable balance in complexity: the environment is much simpler than traditional platformer games like Sonic the Hedgehog, but it still poses a worthy generalization challenge for state of the art algorithms. Read PaperView Code The Generalization Challenge Generalizing between tasks remains difficult for state of the art deep reinforcement learning (RL) algorithms. Although trained agents can solve complex tasks, they struggle to transfer their experience to new environments. Even though people know that RL agents tend to overfit — that is, to latch onto the specifics of their environment rather than learn generalizable skills — RL agents are still benchmarked by evaluating on the environments they trained on. This would be like testing on your training set in supervised learning! Previous work has used the Sonic benchmark, procedurally generated gridworld mazes, and the General Video Game AI framework to address this problem. In all cases, generalization is measured by training and testing agents on different sets of levels. Agents trained on our Sonic benchmark were great at the training levels but performed poorly on the test levels without any fine-tuning. In a similar displays of overfitting, agents trained on procedurally generated mazes learned to memorize a large number of training levels, and GVG-AI agents performed poorly under difficulty settings that weren’t seen during training. Rules of the Game CoinRun was designed to be tractable for existing algorithms and mimics the style of platformer games like Sonic. The levels of CoinRun are procedurally generated, providing agents access to a large and easily quantifiable supply of training data. The goal of each CoinRun level is simple: collect the single coin that lies at the end of the level. Several obstacles, both stationary and non-stationary, lie between the agent and the coin. A collision with an obstacle results in the agent’s immediate death. The only reward in the environment is obtained by collecting the coin, and this reward is a fixed positive constant. The level terminates when the agent dies, the coin is collected, or after 1000 time steps. Each level of CoinRun has a difficulty setting from 1 to 3. Two levels are displayed above: Difficulty-1 (left) and Difficulty-3 (right) Evaluating Generalization We trained 9 agents to play CoinRun, each with a different number of available training levels. The first 8 agents trained on sets ranging from of 100 to 16,000 levels. We trained the final agent on an unrestricted set of levels, so this agent never sees the same level twice. We trained our agents with policies using a common 3-layer convolutional architecture, which we call Nature-CNN. Our agents trained with Proximal Policy Optimization (PPO) for a total of 256M timesteps. Since an epsiode lasts 100 timesteps on average, agents with fixed training sets will see each training level thousands to millions of times. The final agent, trained with the unrestricted set, will see roughly 2 million distinct levels — each of them exactly once. We collected each data point in the following graphs by averaging the final agent’s performance across 10,000 episodes. At test time, the agent is evaluated on never-before-seen levels. We discovered substantial overfitting occurs when there are less than 4,000 training levels. In fact, we still see overfitting even with 16,000 training levels! Unsurprisingly, agents trained with the unrestricted set of levels performed best, as these agents had access to the most data. These agents are represented by the dotted line in the following graphs. We compared our Nature-CNN baseline against the convolutional architecture used in IMPALA and found the IMPALA-CNN agents generalized much better with any training set as seen below. Final train and test performance of Nature-CNN agents after 256M timesteps, as a function of the number of training levels. Final train and test performance of IMPALA-CNN agents after 256M timesteps, as a function of number of training levels. Improving Generalization Performance In our next experiments, we used a fixed training set of 500 CoinRun levels. Our baseline agents struggle to generalize with so few levels, making this an ideal training set for a benchmark. We encourage others to evaluate their own methods by training on the same 500 levels, directly comparing test time performance. Using this training set, we investigated the impact of several regularization techniques: Dropout and L2 regularization: Both noticeably reduce the generalization gap, though L2 regularization has a bigger impact. Data augmentation (modified Cutout) and batch normalization: Both data augmentation and batch normalization significantly improve generalization. Environmental stochasticity: Training with stochasticity improves generalization to a greater extent than any of the previously mentioned techniques (see the paper for details). Additional Environments We also developed two additional environments to investigate overfitting: a CoinRun variant called CoinRun-Platforms and a simple maze navigation environment called RandomMazes. In these experiments, we used the original IMPALA-CNN architecture followed by a LSTM, since memory is necessary to perform well in these environments. In CoinRun-Platforms, there are several coins the agent attempts to collect within the 1000 step time-limit. Coins are randomly scattered across platforms in the level. Levels are a larger, fixed size in CoinRun-Platforms, so the agent must more actively explore, occasionally retracing its steps. Final train and test performance in CoinRun-Platforms after 2B timesteps, as a function of the number of training levels. When we ran both CoinRun-Platforms and RandomMazes through our baseline experiment, our agents strongly overfit in all cases. We observe particularly strong overfitting in the case of RandomMazes, as a sizeable generalization gap remains even when using 20,000 training levels. A level in RandomMazes showing the agent’s observation space (left). Final train and test performance shown as a function of the number of training levels (right). Next Steps Our results provide insight into the challenges underlying generalization in RL. Using the procedurally generated CoinRun environment, we can precisely quantify such overfitting. With this metric, we can better evaluate key architectural and algorithmic decisions. We believe that the lessons learned from this environment will apply in more complex settings, and we hope to use this benchmark, and others like it, to iterate towards more generalizable agents. We suggest the following for future research: Investigate the relationship between environment complexity and the number of levels required for good generalization Investigate whether different recurrent architectures are better suited for generalization in these environments Explore ways to effectively combine different regularization methods If you are interested in this line of research, consider working at OpenAI! Acknowledgments Thanks to the many people who contributed to this paper and blog post: Oleg Klimov, Chris Hesse, Taehoon Kim, John Schulman, Mira Murati, Jack Clark, Ashley Pilipiszyn, Matthias Plappert, Ilya Sutskever, Greg Brockman External Reviewers Jon Walsh, Caleb Kruse, Nikhil Mishra Assets Kenney Footnotes Even impressive RL policies are often trained without supervised learning techniques such as dropout and batch normalization. In the CoinRun generalization regime, however, we find that these methods do have a positive impact and that our previous RL policies were overfitting to particular MDPs. ↩︎ Read more »
  • Spinning Up in Deep RL
    true
    Take your first steps in Deep RL At OpenAI, we believe that deep learning generally—and deep reinforcement learning specifically—will play central roles in the development of powerful AI technology. While there are numerous resources available to let people quickly ramp up in deep learning, deep reinforcement learning is more challenging to break into. We've designed Spinning Up to help people learn to use these technologies and to develop intuitions about them. We were inspired to build Spinning Up through our work with the OpenAI Scholars and Fellows initiatives, where we observed that it's possible for people with little-to-no experience in machine learning to rapidly ramp up as practitioners, if the right guidance and resources are available to them. Spinning Up in Deep RL was built with this need in mind and is integrated into the curriculum for 2019 cohorts of Scholars and Fellows. We've also seen that being competent in RL can help people participate in interdisciplinary research areas like AI safety, which involve a mix of reinforcement learning and other skills. We've had so many people ask for guidance in learning RL from scratch, that we've decided to formalize the informal advice we've been giving. Spinning Up in Deep RL consists of the following core components: A short introduction to RL terminology, kinds of algorithms, and basic theory. An essay about how to grow into an RL research role. A curated list of important papers organized by topic. A well-documented code repo of short, standalone implementations of: Vanilla Policy Gradient (VPG), Trust Region Policy Optimization (TRPO), Proximal Policy Optimization (PPO), Deep Deterministic Policy Gradient (DDPG), Twin Delayed DDPG (TD3), and Soft Actor-Critic (SAC). And a few exercises to serve as warm-ups. Support We have the following support plan for this project: High-bandwidth software support period: For the first three weeks following release we'll move quickly on bug-fixes, installation issues, and resolving errors or ambiguities in the docs. We’ll work hard to streamline the user experience, in order to make it as easy as possible to self-study with Spinning Up. Major review in April, 2019: Approximately six months after release, we’ll do a serious review of the state of the package based on feedback we receive from the community, and announce any plans for future modification. Public release of internal development: If we make changes to Spinning Up in Deep RL as we work with our Scholars and Fellows, we’ll push the changes to the public repo and make them immediately available to everyone. Education at OpenAI Spinning Up in Deep RL is part of a new education initiative at OpenAI which we’re ‘spinning up’ to ensure we fulfill one of the tenets of the OpenAI Charter: "seek to create a global community working together to address AGI’s global challenges". We hope Spinning Up will allow more people to become familiar with deep reinforcement learning, and use it to help advance safe and broadly beneficial AI. We're going to host a workshop on Spinning Up in Deep RL at OpenAI San Francisco on February 2nd 2019. The workshop will consist of 3 hours of lecture material and 5 hours of semi-structured hacking, project-development, and breakout sessions - all supported by members of the technical staff at OpenAI. Ideal attendees have software engineering experience and have tinkered with ML but no formal ML experience is required. If you're interested in participating please complete our short application here. The application will close on December 8th 2018, and acceptances will be sent out on December 17th 2018. If you want to help us push the limits of AI while communicating with and educating others, then consider applying to work at OpenAI. Partnerships We’re also going to work with other organizations to help us educate people using these materials. For our first partnership, we’re working with the Center for Human-Compatible AI (CHAI) at the University of California at Berkeley to run a workshop on deep RL in early 2019, similar to the planned Spinning Up workshop at OpenAI. We hope this will be the first of many. Hello World The best way to get a feel for how deep RL algorithms perform is to just run them. With Spinning Up, that’s as easy as: python -m spinup.run ppo --env CartPole-v1 --exp_name hello_world At the end of training, you’ll get instructions on how to view data from the experiments and watch videos of your trained agent. Spinning Up implementations are compatible with Gym environments from the Classic Control, Box2D, or MuJoCo task suites. We’ve designed the code for Spinning Up with newcomers in mind, making it short, friendly, and as easy to learn from as possible. Our goal was to write minimal implementations to demonstrate how the theory becomes code, avoiding the layers of abstraction and obfuscation typically present in deep RL libraries. We favor clarity over modularity—code reuse between implementations is strictly limited to logging and parallelization utilities. Code is annotated so that you always know what’s going on, and is supported by background material (and pseudocode) on the corresponding readthedocs page. Acknowledgments Thanks to the many people who contributed to this launch: Alex Ray, Amanda Askell, Ashley Pilipiszyn, Ben Garfinkel, Catherine Olsson, Christy Dennison, Coline Devin, Daniel Zeigler, Dylan Hadfield-Menell, Eric Sigler, Ge Yang, Greg Khan, Ian Atha, Jack Clark, Jonas Rothfuss, Larissa Schiavo, Lilian Weng, Maddie Hall, Matthias Plappert, Miles Brundage, Peter Zokhov & Pieter Abbeel. Illustration Leandro Castelao Read more »
  • Learning Concepts with Energy Functions
    true
    We've developed an energy-based model that can quickly learn to identify and generate instances of concepts, such as near, above, between, closest, and furthest, expressed as sets of 2d points. Our model learns these concepts after only five demonstrations. We also show cross-domain transfer: we use concepts learned in a 2d particle environment to solve tasks on a 3-dimensional physics-based robot. Read PaperView Videos Many hallmarks of human intelligence, such as generalizing from limited experience, abstract reasoning and planning, analogical reasoning, creative problem solving, and capacity for language require the ability to consolidate experience into concepts, which act as basic building blocks of understanding and reasoning. Our technique enables agents to learn and extract concepts from tasks, then use these concepts to solve other tasks in various domains. For example, our model can use concepts learned in a two-dimensional particle environment to let it carry out the same task on a three-dimensional physics-based robotic environment - without retraining in the new environment. Your browser does not support video A simulated robot trained via an energy-based model navigates its arm to be between two points, using a concept learned in a different 2D domain. This work uses energy functions to let our agents learn to classify and generate simple concepts, which they can use to solve tasks like navigating between two points in dissimilar environments. Examples of concepts include visual ("red" or "square"), spatial ("inside", "on top of"), temporal ("slow", "after"), social ("aggressive", "helpful") among others. These concepts, once learned, act as basic building blocks of agent’s understanding and reasoning, as shown in other research from DeepMind and Vicarious. Energy functions let us build systems that can generate (left) and also identify (right) basic concepts, like the notion of a square. Energy functions work by encoding a preference over states of the world, which allows an agent with different available actions (changing torque vs directly changing position) to learn a policy that works in different contexts - this roughly translates to the development of a conceptual understanding of simple things. How It Works To create the energy function, we mathematically represent concepts as energy models. The idea of energy models is rooted in physics, with the intuition that observed events and states represent low-energy configurations. We define an energy function E(x, a, w) for each concept in terms of: The state of the world the model observes (x) An attention mask (a) over entities in that state. A continuous-valued vector (w), used as conditioning, that specifies the concept for which energy is being calculated States of the world are composed of sets of entities and their properties and positions (like the dots below, which have both positional and colored properties). Attention masks, used for “identification”, represent a model’s focus on some set of entities. The energy model outputs a single positive number indicating whether the concept is satisfied (when energy is zero) or not (when energy is high). A concept is satisfied when an attention mask is focused on a set of entities that represent a concept, which requires both that the entities are in the correct positions (modification of x, or generation) and that the right entities are being focused on (modification of a, or identification). We construct the energy function as a neural network based on the relational network architecture, which allows it to take an arbitrary number of entities as input. The parameters of this energy function are what is being optimized by our training procedure; other functions are derived implicitly from the energy function. This approach lets us use energy functions to learn a single network that can perform both generation and recognition. This allows us to cross-employ concepts learned from generation to identification, and vice versa. (Note: This effect is already observed in animals via mirror neurons.) Single Network Training Our training data is composed of trajectories of (attention mask, state), which we generate ahead of time for the specific concepts we’d like our model to learn. We train our model by giving it a set of demonstrations (typically 5) for a given concept set, and then give it a new environment (X0) and ask it to predict the next state (X1) and next attention mask (a). We optimize the energy function such that the next state and next attention mask found in the training data are assigned low energy values. Similar to generative models like variational autoencoders, the model is incentivized to learn values that usefully compress aspects of the task. We trained our model using a variety of concepts involving, visual, spatial, proximal, and temporal relations, and quantification in a two-dimensional particle environment. Your browser does not support video Spatial Region Concepts: given demonstration 2D points (left), energy function over point placement is inferred (middle), stochastic gradient descent over energy is then used to generate new points (right) Key Results We evaluated our approach across a suite of tasks designed to see how well our single system could learn to identify and generate things united by the same concept; our system can learn to classify and generate specific sets of spatial relationships, or can navigate entities through a scene in a specific way, or can develop good judgements for concepts like quantity (one, two, three, or more than three) or proximity. Your browser does not support video Quantity Concept: demonstration attention is placed on one, two, three, or more than three entities. Inference is used to generate attention masks of similar quantity Models perform better when they can share experience between learning to generate concepts (by moving entities within the state vector x) and identify them (by changing the attention mask over a fixed state vector): when we evaluated models trained on both of these operations, they performed better on each single operation than models trained only on that single operation alone. We also discovered indications of transfer learning - an energy function trained only on a recognition context performs well on generation, even without being explicitly trained to do so. Proximity Concepts: demonstration events bring attention to the entity closest or furthest to the marker or to bring the marker to be closest or furthest to entity of a particular color (left). Inference is used to generate attention masks for closest or further entity (recognition) or to place the marker to be closest or furthest from an entity (generation) (right) Next Steps In the future we’re excited to explore a wider variety of concepts learned in richer, three-dimensional environments, integrate concepts with the decision-making policies of our agents (we have so far only looked at concepts as things learned from passive experience), and explore connections between concepts and language understanding. If you are interested in this line of research, consider working at OpenAI! Acknowledgements Thanks to those who contributed to this paper and blog post: Blog post: Prafulla Dhariwal, Alex Nichol, Alec Radford, Yura Burda, Jack Clark, Greg Brockman, Ilya Sutskever, Ashley Pilipiszyn Read more »
  • Reinforcement Learning with Prediction-Based Rewards
    true
    We’ve developed Random Network Distillation (RND), a prediction-based method for encouraging reinforcement learning agents to explore their environments through curiosity, which for the first time[1] exceeds average human performance on Montezuma’s Revenge. RND achieves state-of-the-art performance, periodically finds all 24 rooms and solves the first level without using demonstrations or having access to the underlying state of the game. RND incentivizes visiting unfamiliar states by measuring how hard it is to predict the output of a fixed random neural network on visited states. In unfamiliar states it's hard to guess the output, and hence the reward is high. It can be applied to any reinforcement learning algorithm, is simple to implement and efficient to scale. Below we release a reference implementation of RND that can reproduce the results from our paper. Read PaperView Code Progress in Montezuma’s Revenge For an agent to achieve a desired goal it must first explore what is possible in its environment and what constitutes progress towards the goal. Many games’ reward signals provide a curriculum such that even simple exploration strategies are sufficient for achieving the game’s goal. In the seminal work introducing DQN, Montezuma’s Revenge was the only game where DQN got 0% of the average human score (4.7K). Simple exploration strategies are highly unlikely to gather any rewards, or see more than a few of the 24 rooms in the level. Since then advances in Montezuma’s Revenge have been seen by many as synonymous with advances in exploration. Significant progress was made in 2016 by combining DQN with a count-based exploration bonus, resulting in an agent that explored 15 rooms, achieved a high score of 6.6K and an average reward of around 3.7K. Since then, significant improvement in the score achieved by an RL agent has come only from exploiting access to demonstrations from human experts, or access to the underlying state of the emulator. We ran a large scale RND experiment with 1024 rollout workers resulting in a mean return of 10K over 9 runs and a best mean return of 14.5K. Each run discovered between 20 and 22 rooms. In addition one of our smaller scale but longer running experiments yielded one run (out of 10) that achieved a best return of 17.5K corresponding to passing the first level and finding all 24 rooms. The graph below compares these two experiments showing the mean return as a function of parameter updates. The visualization below shows the progress of the smaller scale experiment in discovering the rooms. Curiosity drives the agent to discover new rooms and find ways of increasing the in-game score, and this extrinsic reward drives it to revisit those rooms later in the training. Your browser does not support video Rooms discovered by the agent and mean episodic return throughout the training. The opacity of the room corresponds to how many runs out of 10 discovered it. Large-Scale Study of Curiosity-Driven Learning Prior to developing RND, we, together with collaborators from UC Berkeley, investigated learning without any environment-specific rewards. Curiosity gives us an easier way to teach agents to interact with any environment, rather than via an extensively engineered task-specific reward function that we hope corresponds to solving a task. Projects like ALE, Universe, Malmo, Gym, Gym Retro, Unity, DeepMind Lab, CommAI make a large number of simulated environments available for an agent to interact with through a standardized interface. An agent using a generic reward function not specific to the particulars of an environment can acquire a basic level of competency in a wide range of environments, resulting in the agent’s ability to determine what useful behaviors are even in the absence of carefully engineered rewards. Read PaperView Code In standard reinforcement learning set-ups, at every discrete time-step the agent sends an action to the environment, and the environment responds by emitting the next observation, transition reward and an indicator of episode end. In our previous paper we require the environment to output only the next observation. There, the agent learns a next-state predictor model from its experience, and uses the error of the prediction as an intrinsic reward. As a result it is attracted to the unpredictable. For example, it will find a change in a game score to be rewarding only if the score is displayed on the screen and the change is hard to predict. The agent will typically find interactions with new objects rewarding, as the outcomes of such interactions are usually harder to predict than other aspects of the environment. Similar to prior work, we tried to avoid modeling all aspects of the environment, whether they are relevant or not, by choosing to model features of the observation. Surprisingly, we found that even random features worked well. What Do Curious Agents Do? We tested our agent across 50+ different environments and observed a range of competence levels from seemingly random actions to deliberately interacting with the environment. To our surprise, in some environments the agent achieved the game’s objective even though the game’s objective was not communicated to it through an extrinsic reward. Intrinsic reward at the beginning of training Intrinsic reward spike when passing the level for the first time Breakout – The agent experiences spikes of intrinsic reward when it sees a new configuration of bricks early on in training and when it passes the level for the first time after training for several hours. Your browser does not support video Pong – We trained the agent to control both paddles at the same time and it learned to keep the ball in play resulting in long rallies. Even when trained against the in-game AI, the agent tried to prolong the game rather than win. Your browser does not support video Bowling – The agent learned to play the game better than agents trained to maximize the (clipped) extrinsic reward directly. We think this is because the agent gets attracted to the difficult-to-predict flashing of the scoreboard occurring after the strikes. Your browser does not support video Mario – The intrinsic reward is particularly well-aligned with the game’s objective of advancing through the levels. The agent is rewarded for finding new areas because the details of a newly found area are impossible to predict. As a result the agent discovers 11 levels, finds secret rooms, and even defeats bosses. The Noisy-TV Problem Like a gambler at a slot machine attracted to chance outcomes, the agent sometimes gets trapped by its curiosity as the result of the noisy-TV problem. The agent finds a source of randomness in the environment and keeps observing it, always experiencing a high intrinsic reward for such transitions. Watching a TV playing static noise is an example of such a trap. We demonstrate this literally by placing the agent in a Unity maze environment with a TV playing random channels. Agent in a maze with a noisy TV Agent in a maze without a noisy TV While the noisy-TV problem is a concern in theory, for largely deterministic environments like Montezuma’s Revenge, we anticipated that curiosity would drive the agent to discover rooms and interact with objects. We tried several variants of next-state prediction based curiosity combining the exploration bonus with the score from the game. In these experiments the agent controls the environment through a noisy controller that repeats the last action instead of the current one with some probability. This setup with sticky actions was suggested as a best practice for training agents on fully deterministic games like Atari to prevent memorization. Sticky actions make the transition from room to room unpredictable. Random Network Distillation Since next-state prediction is inherently susceptible to the noisy-TV problem, we identified the following relevant sources of prediction errors: Factor 1: Prediction error is high where the predictor fails to generalize from previously seen examples. Novel experience then corresponds to high prediction error. Factor 2: Prediction error is high because the prediction target is stochastic. Factor 3: Prediction error is high because information necessary for the prediction is missing, or the model class of predictors is too limited to fit the complexity of the target function. We determined Factor 1 is a useful source of error since it quantifies the novelty of experience, whereas Factors 2 and 3 cause the noisy-TV problem. To avoid Factors 2 and 3, we developed RND, a new exploration bonus that is based on predicting the output of a fixed and randomly initialized neural network on the next state, given the next state itself. The intuition is that predictive models have low error in states similar to the ones they have been trained on. In particular the agent's predictions of the output of a randomly initialized neural network will be less accurate in novel states than in states the agent visited frequently. The advantage of using a synthetic prediction problem is that we can have it be deterministic (bypassing Factor 2) and inside the class of functions the predictor can represent (bypassing Factor 3) by choosing the predictor to be of the same architecture as the target network. These choices make RND immune to the noisy-TV problem. We combine the exploration bonus with the extrinsic rewards through a variant of Proximal Policy Optimization (PPO) that uses two value heads for the two reward streams. This allows us to use different discount rates for the different rewards, and combine episodic and non-episodic returns. With this additional flexibility, our best agent often finds 22 out of the 24 rooms on the first level in Montezuma’s Revenge, and occasionally passes the first level after finding the remaining two rooms. The same method gets state-of-the-art performance on Venture and Gravitar. The visualization of the RND bonus below shows a graph of the intrinsic reward over the course of an episode of Montezuma’s Revenge where the agent finds the torch for the first time. Implementation Matters Big-picture considerations like susceptibility to the noisy-TV problem are important for the choice of a good exploration algorithm. However, we found that getting seemingly-small details right in our simple algorithm made the difference between an agent that never leaves the first room and an agent that can pass the first level. To add stability to the training, we avoided saturation of the features and brought the intrinsic rewards to a predictable range. We also noticed significant improvements in performance of RND every time we discovered and fixed a bug (our favorite one involved accidentally zeroing an array which resulted in extrinsic returns being treated as non-episodic; we realized this was the case only after being puzzled by the extrinsic value function looking suspiciously periodic). Getting such details right was a significant part of achieving high performance even with algorithms conceptually similar to prior work. This is one reason to prefer simpler algorithms where possible. Future Directions We suggest the following paths forward for future research: Analyze the benefits of different exploration methods and find novel ways of combining them. Train a curious agent on many different environments without reward and investigate the transfer to target environments with rewards. Investigate global exploration that involves coordinated decisions over long time horizons. If you are interested in working on overcoming these challenges then apply to work with us! Acknowledgments Thanks to those who contributed to these papers and this blog post: Large-Scale Study of Curiosity-Driven Learning: Yuri Burda*, Harrison Edwards*, Deepak Pathak*, Amos Storkey, Trevor Darrell, Alexei A. Efros Exploration by Random Network Distillation: Yuri Burda*, Harrison Edwards*, Amos Storkey, Oleg Klimov Equal contributions: Blog post: Karl Cobbe, Alex Nichol, Joshua Achiam, Phillip Isola, Alex Ray, Jonas Schneider, Jack Clark, Greg Brockman, Ilya Sutskever, Ben Barry, Amos Storkey, Alexei Efros, Deepak Pathak, Trevor Darrell, Andrew Brock, Antreas Antoniou, Stanislaw Jastrzebski, Ashley Pilipiszyn, Justin Wang Footnotes There is an anonymous ICLR submission concurrent with our own work which exceeds human performance, though not to the same extent. ↩︎ Read more »
  • Learning Complex Goals with Iterated Amplification
    true
    We’re proposing an AI safety technique called iterated amplification that lets us specify complicated behaviors and goals that are beyond human scale, by demonstrating how to decompose a task into simpler sub-tasks, rather than by providing labeled data or a reward function. Although this idea is in its very early stages and we have only completed experiments on simple toy algorithmic domains, we’ve decided to present it in its preliminary state because we think it could prove to be a scalable approach to AI safety. Read Paper If we want to train an ML system to perform a task, we need a training signal — a way to evaluate how well it is doing in order to help it learn. For example, labels in supervised learning or rewards in reinforcement learning are training signals. The formalism of ML usually assumes a training signal is already present and focuses on learning from it, but in reality the training signal has to come from somewhere. If we don’t have a training signal we can’t learn the task, and if we have the wrong training signal, we can get unintended and sometimes dangerous behavior. Thus, it would be valuable for both learning new tasks, and for AI safety, to improve our ability to generate training signals. How do we currently generate training signals? Sometimes, the goal we want can be evaluated algorithmically, like counting up the score in a game of Go or checking whether a set of numbers has been successfully sorted (left panels of figure below). Most real-world tasks don’t lend themselves to an algorithmic training signal, but often we can instead obtain a training signal by having a human either perform the task (for example, labeling a training set or demonstrating an RL task), or judge an AI’s performance on the task (middle panels of figure below). However, many tasks are so complicated that a human can’t judge or perform them - examples might be designing a complicated transit system or managing every detail of the security of a large network of computers (right panels of figure below). Iterated amplification is a method for generating a training signal for the latter types of tasks, under certain assumptions. Namely, although a human can’t perform or judge the whole task directly, we assume that a human can, given a piece of the task, identify clear smaller components of which it’s made up. For example, in the networked computer example, a human could break down “defend a collection of servers and routers” into “consider attacks on the servers”, “consider attacks on the routers”, and “consider how the previous two attacks might interact”. Additionally, we assume a human can do very small instances of the task, for example “identify if a specific line in a log file is suspicious”. If these two things hold true, then we can build up a training signal for big tasks from human training signals for small tasks, using the human to coordinate their assembly. In our implementation of amplification, we start by sampling small subtasks and training the AI system to do them by soliciting demonstrations from humans (who can do these small tasks). We then begin sampling slightly larger tasks, solving them by asking humans to break them up into small pieces, which AI systems trained from the previous step can now solve. We use the solutions to these slightly harder tasks, which were obtained with human help, as a training signal to train AI systems to solve these second-level tasks directly (without human help). We then continue to further composite tasks, iteratively building up a training signal as we go. If the process works, the end result is a totally automated system that can solve highly composite tasks despite starting with no direct training signal for those tasks. This process is somewhat similar to expert iteration (the method used in AlphaGo Zero), except that expert iteration reinforces an existing training signal, while iterated amplification builds up a training signal from scratch. It also has features in common with several recent learning algorithms that use problem decomposition on-the-fly to solve a problem at test time, but differs in that it operates in settings where there is no prior training signal. Experiments As with our previous work on AI safety via debate, working directly on tasks beyond human scale is too hard for a prototype project. Also, using an actual human as training signal introduces complications, so we haven’t done this yet (though we plan to do it in the future). For our first experiments, we instead try to amplify an algorithmic training signal, to show that iterated amplification can work in this simple setting. We also limit our attention to supervised learning (unlike our previous work on human training signals in RL). We’ve tried the method on five toy algorithmic tasks. These tasks have direct algorithmic solutions which we pretend we don’t know (for example, find the shortest path between two points on a graph). The problems can also be solved by piecing together small one-step deductions (for example, combining two paths to form a longer path), but it would take exponentially much effort to piece everything together manually. We use iterated amplification to learn the direct algorithm using only the pieces as training signal, thus simulating the situation where a human knows how to combine subpieces of a solution but can’t provide a direct training signal. On each of these five tasks (permutation powering, sequential assignments, wildcard search, shortest path, and union find), we are able to perform competitively with just directly learning the task via supervised learning, despite being handicapped by not having a direct training signal (the goal here is to match supervised learning with less information, not to surpass it). Iterated amplification has comparable performance to supervised learning without ever seeing the ground truth labels Amplification has features in common with our previous work on AI safety via debate. Like debate, it seeks to train tasks that are beyond human ability to directly perform or judge, through an iterative process that allows humans to provide indirect supervision - however the specific approach is different. It also builds on our work on human feedback, by implementing a reward prediction system, and later versions of it are likely to include feedback from actual humans. So far we’ve explored all of these methods in a preliminary way, and are now working on scaling them up to address more interesting and realistic problems. If you’re interested in helping us extend tools like iterated amplification to build safe, powerful AI, then consider joining OpenAI. Acknowledgments Thanks to the following for feedback on drafts of this post: Catherine Olsson, Geoffrey Irving, Jack Clark, Jacob Steinhardt, Jeff Wu, Michael Page, Miles Brundage Read more »
WordPress RSS Feed Retriever by Theme Mason

Leave a Reply