knowt logo

Phys

Physion: A dataset and benchmark for rigorously evaluating the ability to predict how physical scenarios will evolve. Our dataset features realistic simulations of a wide range of physical phenomena, including rigid and soft-body collisions, stable multi-object configurations, rolling, sliding, and projectile motion, thus providing a more comprehensive challenge than previous benchmarks. FitVid: A simple and scalable variational video prediction model that can attain a significantly better fit to current video prediction datasets even with a similar parameter count as prior models. We therefore propose a set of data augmentation techniques for video prediction that prevent overfitting, leading to state-of-the-art results across a range of prediction benchmarks. Non-causal shortcut learning: a machine learning algorithm that can learn to perform a task without being given all of the information that would normally be necessary to complete that task. For example, a human can learn to ride a bicycle without being told all of the physics behind how bicycles work. *Adding Noise, aka adding randomness: i.e. the model should not be too precise, because humans cannot predict rare flukes. For example, if a human were to predict where a ball will land after it is thrown in the air, they would not be able to predict its exact landing spot, because of the many variables involved (such as air resistance). Instead, they would predict a general area where it would land. Future Research: The dynamics of jointed multi-part objects: how different objects move and interact with each other when joints connect them. For example, if you have two pieces of cardboard that are connected by a hinge, the dynamics of the jointed object would be how the two pieces of cardboard move in relation to each other when the hinge is moved. Adding noise to the models’ forward dynamics might therefore mimic how humans make predictions about probable outcomes, rather than simulating dynamics so precisely that they capture even rare flukes. Building upon FitVid + introducing new training-aware metrics for video prediction and generation to signal when a model is generating high-quality videos by repeating the training data. Create more complex datasets that mix different events of physical interactions in the same frame

Overall Importance: Video prediction or representations learned by video prediction can be a major step forward toward fully autonomous self-learning agents. The Physion benchmark is an important step toward actually measuring whether a given algorithm does perceive visual scenes and makes physical predictions the way people do. If it turns out that this is critical for achieving safe, high performance in some real-world domains, our benchmark (or its successors) could be used to screen for algorithms more likely to behave like people and to diagnose failures, e.g. by breaking them down into problems making predictions about particular physical phenomena. Physion 2.0 Motivation: Design additional nonrigid and mass inference scenarios for inclusion e.g. heaviness with balance, dunking (fluidity), deformity, pendulum Larger amounts of variation in shape and spatial distribution of all objects There exist objects with different shapes and different physics properties. In the physion study, we focus on dynamics modeling for objects with different shapes and focus less on the second part. Different objects will have different dynamics depending on their shape. For example, a sphere will roll down a hill, while a cube will slide down. By modeling the dynamics of different objects, we can better understand how they will move in different situations. Different objects will have different physics properties depending on their composition. For example, a metal object will be heavier and more resistant to damage than a glass object. By understanding the different physical properties of different objects, we can better predict how they will interact with their surroundings. How do people pick up the dynamics of objects with different physics properties? People seem to rely on both static and dynamic visual cues, including the texture of the objects (e.g., metal is heavier and wood is light), or the motion generated by the objects. Here we want to focus on the dynamics of visual cues. The change in an object's appearance over time, such as when an object is in motion. Alternatively, one could define it as the way in which an object's appearance changes when it interacts with other objects. One example of the dynamics of visual cues would be the way in which an object's appearance changes when it is in motion. For example, when an object is moving quickly, it may appear to blur. This is because the human visual system is not able to process all of the information coming in at once, so it relies on cues like motion to help it understand what is happening. Another example of the dynamics of visual cues would be the way in which an object's appearance changes when it interacts with other objects. For example, when two objects collide, they may deform or change color. This is because the human visual system is able to process information about the forces acting on the objects, and this information can help us understand the physics of the situation. It seems people can learn quickly from a short video clip capturing the motion of an object. Aside from that, people can also estimate the rough (relative) magnitude of the physics properties. In what circumstance can people successfully learn the dynamics? In what circumstance will people fail? How accurately can people predict the magnitude of the physics properties? How does the prediction change over time when more information is revealed to the observer? 9/23 Meeting: OCP = Object contact prediction (OCP) task Clarify CAB framework Blocked [varying scenes between] vs. interleaved [within]: Between means that you're comparing different conditions between groups. E.g. you have two separate groups and one group watches a video simulation of dominos falling down, while the other watches a video simulation of a cup dropping onto a bouncy surface. You then compare the group's results Within means, you're comparing different conditions within the same group. Thoughts: Test run to see if performance drops Are they using an incentive check? If performance does drop, this might mitigate that: No; for exclusions, they check to see if people are running through the entire answers; the score is motivating enough. Bonuses are considered a hassle for you because Prolific doesn’t make it easy because you can’t expect people to do “better” Presumption: these videos are played, and people respond after they see the video with yes/no [i.e., they can’t skip the videos] Are these scenarios “common-sense”?: slightly related things to analyze given what we know about people's ability to predict How do they set a threshold for how many scenarios are considered sufficient for generalizing human prediction? Pick a number that seems enough e.g. 150 scenarios, 75 positive, 75 negative that doesn’t touch yellow. 100 participants which seemed high enough to the point it doesn’t really matter. 9/26 Meeting: Run people until you reach 100 people after exclusions From the data, you can determine how many times to replicate + until you hit a significant result Run through Prolific and how to upload the study Cab framework: cognitive AI benchmarking: a repository for experiment code + documentation that is a cleaned-up experiment code for physion; 9/30 Meeting:

  • Run through the experiment Animation -> Binary -> Qs -> fixation cross -> repeat

  • Pilot study

  • Aphantasia: the inability to form mental images of objects that are not present.

  • Share task link to pilot after completing it [share to the undergrad scrum lab to obtain more in-house data). Remember to tell them to time themselves [ask about deadline] 48 hrs

  • Innovation -> analysis that involves that line plot low vs high on phys properties How human material perception works Benchmark vs new model paper 10/7 Meeting: Fix through the stimuli e.g. depth perception so users can view animations from the top-perspective

Phys

Physion: A dataset and benchmark for rigorously evaluating the ability to predict how physical scenarios will evolve. Our dataset features realistic simulations of a wide range of physical phenomena, including rigid and soft-body collisions, stable multi-object configurations, rolling, sliding, and projectile motion, thus providing a more comprehensive challenge than previous benchmarks. FitVid: A simple and scalable variational video prediction model that can attain a significantly better fit to current video prediction datasets even with a similar parameter count as prior models. We therefore propose a set of data augmentation techniques for video prediction that prevent overfitting, leading to state-of-the-art results across a range of prediction benchmarks. Non-causal shortcut learning: a machine learning algorithm that can learn to perform a task without being given all of the information that would normally be necessary to complete that task. For example, a human can learn to ride a bicycle without being told all of the physics behind how bicycles work. *Adding Noise, aka adding randomness: i.e. the model should not be too precise, because humans cannot predict rare flukes. For example, if a human were to predict where a ball will land after it is thrown in the air, they would not be able to predict its exact landing spot, because of the many variables involved (such as air resistance). Instead, they would predict a general area where it would land. Future Research: The dynamics of jointed multi-part objects: how different objects move and interact with each other when joints connect them. For example, if you have two pieces of cardboard that are connected by a hinge, the dynamics of the jointed object would be how the two pieces of cardboard move in relation to each other when the hinge is moved. Adding noise to the models’ forward dynamics might therefore mimic how humans make predictions about probable outcomes, rather than simulating dynamics so precisely that they capture even rare flukes. Building upon FitVid + introducing new training-aware metrics for video prediction and generation to signal when a model is generating high-quality videos by repeating the training data. Create more complex datasets that mix different events of physical interactions in the same frame

Overall Importance: Video prediction or representations learned by video prediction can be a major step forward toward fully autonomous self-learning agents. The Physion benchmark is an important step toward actually measuring whether a given algorithm does perceive visual scenes and makes physical predictions the way people do. If it turns out that this is critical for achieving safe, high performance in some real-world domains, our benchmark (or its successors) could be used to screen for algorithms more likely to behave like people and to diagnose failures, e.g. by breaking them down into problems making predictions about particular physical phenomena. Physion 2.0 Motivation: Design additional nonrigid and mass inference scenarios for inclusion e.g. heaviness with balance, dunking (fluidity), deformity, pendulum Larger amounts of variation in shape and spatial distribution of all objects There exist objects with different shapes and different physics properties. In the physion study, we focus on dynamics modeling for objects with different shapes and focus less on the second part. Different objects will have different dynamics depending on their shape. For example, a sphere will roll down a hill, while a cube will slide down. By modeling the dynamics of different objects, we can better understand how they will move in different situations. Different objects will have different physics properties depending on their composition. For example, a metal object will be heavier and more resistant to damage than a glass object. By understanding the different physical properties of different objects, we can better predict how they will interact with their surroundings. How do people pick up the dynamics of objects with different physics properties? People seem to rely on both static and dynamic visual cues, including the texture of the objects (e.g., metal is heavier and wood is light), or the motion generated by the objects. Here we want to focus on the dynamics of visual cues. The change in an object's appearance over time, such as when an object is in motion. Alternatively, one could define it as the way in which an object's appearance changes when it interacts with other objects. One example of the dynamics of visual cues would be the way in which an object's appearance changes when it is in motion. For example, when an object is moving quickly, it may appear to blur. This is because the human visual system is not able to process all of the information coming in at once, so it relies on cues like motion to help it understand what is happening. Another example of the dynamics of visual cues would be the way in which an object's appearance changes when it interacts with other objects. For example, when two objects collide, they may deform or change color. This is because the human visual system is able to process information about the forces acting on the objects, and this information can help us understand the physics of the situation. It seems people can learn quickly from a short video clip capturing the motion of an object. Aside from that, people can also estimate the rough (relative) magnitude of the physics properties. In what circumstance can people successfully learn the dynamics? In what circumstance will people fail? How accurately can people predict the magnitude of the physics properties? How does the prediction change over time when more information is revealed to the observer? 9/23 Meeting: OCP = Object contact prediction (OCP) task Clarify CAB framework Blocked [varying scenes between] vs. interleaved [within]: Between means that you're comparing different conditions between groups. E.g. you have two separate groups and one group watches a video simulation of dominos falling down, while the other watches a video simulation of a cup dropping onto a bouncy surface. You then compare the group's results Within means, you're comparing different conditions within the same group. Thoughts: Test run to see if performance drops Are they using an incentive check? If performance does drop, this might mitigate that: No; for exclusions, they check to see if people are running through the entire answers; the score is motivating enough. Bonuses are considered a hassle for you because Prolific doesn’t make it easy because you can’t expect people to do “better” Presumption: these videos are played, and people respond after they see the video with yes/no [i.e., they can’t skip the videos] Are these scenarios “common-sense”?: slightly related things to analyze given what we know about people's ability to predict How do they set a threshold for how many scenarios are considered sufficient for generalizing human prediction? Pick a number that seems enough e.g. 150 scenarios, 75 positive, 75 negative that doesn’t touch yellow. 100 participants which seemed high enough to the point it doesn’t really matter. 9/26 Meeting: Run people until you reach 100 people after exclusions From the data, you can determine how many times to replicate + until you hit a significant result Run through Prolific and how to upload the study Cab framework: cognitive AI benchmarking: a repository for experiment code + documentation that is a cleaned-up experiment code for physion; 9/30 Meeting:

  • Run through the experiment Animation -> Binary -> Qs -> fixation cross -> repeat

  • Pilot study

  • Aphantasia: the inability to form mental images of objects that are not present.

  • Share task link to pilot after completing it [share to the undergrad scrum lab to obtain more in-house data). Remember to tell them to time themselves [ask about deadline] 48 hrs

  • Innovation -> analysis that involves that line plot low vs high on phys properties How human material perception works Benchmark vs new model paper 10/7 Meeting: Fix through the stimuli e.g. depth perception so users can view animations from the top-perspective

robot