Visual data and text data are composed of information at multiple
granularities. A video can describe a complex scene that is composed of
Use your arXiv email address to see your arXiv papers in GroundAI.
By signing up you accept our content policy
Already have an account? Sign in
No a member yet? Create an account