GITHUB HUGGING FACE MODELSCOPE KAGGLE DEMO DISCORD

Language and vision intertwine in the human mind, shaping how we perceive and understand the world around us. Our ability to reason is deeply rooted in both linguistic thought and visual memory - but what happens when we extend these capabilities to AI? Today’s large language models have demonstrated remarkable reasoning abilities, but we wondered: could they harness the power of visual understanding to reach new heights of cognitive capability?

Imagine an AI that can look at a complex physics problem, and methodically reason its way to a solution with the confidence of a master physicist. This vision inspired us to create QVQ - an open-weight model for multimodal reasoning, built upon Qwen2-VL-72B. QVQ represents a significant leap forward in AI’s capacity for visual understanding and complex problem-solving. QVQ achieves a score of 70.3 on MMMU and shows substantial improvements across math-related benchmarks compared to Qwen2-VL-72B-Instruct. Through careful step-by-step reasoning, QVQ demonstrates enhanced capabilities in visual reasoning tasks, particularly excelling in domains that demand sophisticated analytical thinking.

Limitations

QvQ-72B-Preview is an experimental research model developed by the Qwen team, focusing on enhancing visual reasoning capabilities. While it has demonstrated performance that exceeds expectations, there are several limitations to be aware of:

  1. Language Mixing and Code-Switching: The model may mix languages or switch between them unexpectedly, affecting response clarity.
  2. Recursive Reasoning: The model may get stuck in circular logic patterns, producing verbose responses without reaching conclusions.
  3. Safety and Ethical Considerations: The model requires enhanced safety measures to ensure reliable and secure performance, and users should be cautious when deploying it.
  4. Performance and Benchmark Limitations: Although the model has shown improvements in visual reasoning, it cannot fully replace the capabilities of Qwen2-VL-72B-Instruct. Additionally, during multi-step visual reasoning, the model may gradually lose focus on the image content, leading to hallucinations.

Performance

We evaluate QVQ-72B-Preview on 4 datasets, including:

  • MMMU: A university-level multidisciplinary multimodal evaluation dataset designed to assess models’ visual-related comprehensive understanding and reasoning capabilities.
  • MathVista: A mathematics-focused visual reasoning test set that evaluates capabilities such as logical reasoning with puzzle test graphics, algebraic reasoning with function graphs, and scientific reasoning with academic paper figures.
  • MathVision: A high-quality multimodal mathematical reasoning test set derived from real mathematics competitions, featuring greater problem diversity and subject breadth compared to MathVista.
  • OlympiadBench: An Olympic competition-level bilingual multimodal science benchmark test set containing 8,476 problems from Olympic mathematics and physics competitions, including the Chinese college entrance examination. Each problem comes with expert-level annotations detailing the step-by-step reasoning process.

In particular, QVQ-72B-Preview has achieved an impressive score of 70.3 on the MMMU benchmark, significantly outpacing its predecessor, Qwen2-VL-72B-Instruct. Furthermore, in the remaining three benchmarks focused on mathematics and science problems, the model demonstrates exceptional performance, effectively closing the gap with the leading state-of-the-art o1 model.

Demo Cases

In the following section, we present several examples to illustrate the application of this new model in visual reasoning tasks.

Example: Math Next
User
Given $h(x) = f(x) \cdot g(x)$, find $h’(2)$ using the table below: image
QVQ-72B-Preview

Let’s tackle this problem step by step. So, we have two functions, f(x) and g(x), and their derivatives f’(x) and g’(x), tabulated for specific values of x. We’re asked to find h’(2), where h(x) is the product of f(x) and g(x). That means h(x) = f(x) * g(x).

First, I need to recall the product rule for derivatives. The product rule tells us how to find the derivative of the product of two functions. It states that if you have two differentiable functions u(x) and v(x), then the derivative of their product is given by:

(u*v)’(x) = u’(x)*v(x) + u(x)*v’(x)

So, in this case, u(x) is f(x) and v(x) is g(x). Therefore, h’(x) = f’(x)*g(x) + f(x)*g’(x).

Now, I need to find h’(2), which means I need to evaluate this derivative at x = 2.

Looking at the table:

  • At x = 2, f(2) = -3

  • f’(2) = -8

  • g(2) = 1

  • g’(2) = 7

Plugging these values into the product rule formula:

h’(2) = f’(2)*g(2) + f(2)*g’(2)

= (-8)(1) + (-3)(7)

= -8 + (-21)

= -29

Wait a minute, let me double-check that. Maybe I should consider if there’s any other way to approach this problem.

Another approach could be to consider the properties of derivatives and see if there’s a different way to compute h’(2). But I think the product rule is the most straightforward method here.

Let me verify the values from the table again:

  • f(2) = -3

  • f’(2) = -8

  • g(2) = 1

  • g’(2) = 7

These seem correct as per the table.

Alternatively, perhaps I should consider if the functions f(x) and g(x) have any special properties or if there’s a pattern I’m missing. For example, is g(x) a constant function? Looking at the table:

  • g(-2) = -11

  • g(-1) = -2

  • g(0) = -1

  • g(1) = -2

  • g(2) = 1

It doesn’t seem like g(x) is constant, as the values change. Similarly, f(x) is changing as well.

Maybe I should think about the units if this were a real-world problem, but since it’s an abstract mathematical problem, that might not help.

Alternatively, perhaps I should consider the geometric interpretation of the derivative. The derivative h’(2) represents the instantaneous rate of change of h(x) at x = 2. But again, that might not help in calculating the numerical value.

Let me try plugging in the values again to make sure I didn’t make a miscalculation.

h’(2) = f’(2)*g(2) + f(2)*g’(2)

= (-8)(1) + (-3)(7)

= -8 + (-21)

= -29

No, that seems correct.

Wait, maybe I should consider if the table has any units or if there’s a scaling factor I’m missing, but the problem doesn’t indicate any units, so I assume the values are just numerical.

Alternatively, perhaps there’s a typo in the problem or the table, but based on the given information, the calculation seems straightforward.

Let me consider if there’s a way to cross-verify this result. Suppose I were to consider the function h(x) = f(x)*g(x), and I want to find h’(2).

Another way to think about it is to consider the definition of the derivative:

h’(2) = lim[h -> 0] [h(2 + h) - h(2)] / h

But without knowing the explicit forms of f(x) and g(x), this approach isn’t practical because I can’t compute h(2 + h) for arbitrary h.

Therefore, relying on the product rule seems like the most appropriate method here.

So, my final answer is h’(2) = -29.

Final Answer

$$ \boxed{-29} $$

Next Step

As we progress towards achieving AGI, our vision is to develop a omni and smart model. To realize this goal, we are enhancing our vision-language foundation model with advanced capabilities for deep thinking and reasoning based on visual information. In the near future, we plan to integrate additional modalities into a unified model, making it even more intelligent and capable of addressing complex challenges and engaging in scientific exploration.