Qwen3-VL can scan two-hour videos and pinpoint nearly every detail

(the-decoder.com)

80 points | by thm 2 days ago

8 comments

visioninmyblood 2 hours ago
I was using this for video understanding with inference form vlm.run infra. It definitely has outperformed Gemini which generally is much better than openai or Claude on videos. The detailed extraction is pretty good. With agents you can also crop into a segment and do more operations on it. have to see how the multi modal space progresses:
link to results: https://chat.vlm.run/c/82a33ebb-65f9-40f3-9691-bc674ef28b52
Quick demo: https://www.youtube.com/watch?v=78ErDBuqBEo
[-]
- colechristensen 19 minutes ago
  I found it pretty funny how bad Claude was at cropping an image. It was a cute little character with some text off to the side on a white background, all very clean cartoon vibes and it COULD NOT just select the character. I pursued it for 20 minutes because I thought it was funny. Of course it was 45 seconds to do it myself.
  A lot of my side projects involve UIs and almost all of my problems with getting LLMs to write them for me involve "The UI isn't doing what you say it's doing" and struggling to get A) a reliable way to get it to look at the UI so it can continue its loop and B) getting it to understand what it's looking at well enough to do something about it
eurekin 1 hour ago
Insane if true... now I wonder, if I use it to go through some old dance routing video catalogue to recognize and write individual move lists
djmips 2 hours ago
Does anyone else worry about this technology used for Big Brother type surveillance?
[-]
- reactordev 2 hours ago
  Where have you been the last decade? It’s already in use, or models like it, by companies selling access to The State
  https://deflock.me
  Not to mention cloud platforms that collect evidence and process it with all the models and store that information for searching…
  https://www.revir.ai
  [-]
  - eurekin 1 hour ago
    No mention of palantir?
    [-]
    - bilbo0s 1 hour ago
      >It’s already in use, or models like it, by companies selling access to The State
      Doesn't that pretty much cover Palantir as well?
  - mptest 1 hour ago
    or if you prefer your depression in book format: surveillance capitalism by zuboff pegasus: a spy in your pocket laurent richard
- basilgohar 43 minutes ago
  How do you think this tech was developed in the first place? It's probably trained and used in the surveillance bid for a decade before it comes to consumers, and this probably isn't the SoA stuff that governments have access to, we're probably 5-10 years behind what's on the cutting edge.
- protocolture 40 minutes ago
  We got Facial Rec and LPR first, those are more dangerous for surveillance.
- g-mork 1 hour ago
  warmly encourage you avoid reading the header files of the dahua camera SDK
thot_experiment 2 days ago
anyone have a tl;dr for me on what the best way to get the video comprehension stuff going is? i use qwen-30b-vl all the time locally as my goto model because it's just so insanely fast, curious to mess with the video stuff, the vision comprehension works great and i use it for OCR and classification all the time
[-]
- xrd 2 hours ago
  How much VRAM do you need for local usage may I ask?
moralestapia 2 hours ago
To me, this qualifies as some sort ASI already.
spwa4 2 hours ago
It's so weird how that works with transformers.
Finetuning an LLM "backbone" (if I understand correctly: a fully trained but not instruction tuned LLM, usually small because students) with OCR tokens bests just about every OCR network out there.
And it's not just OCR. Describing images. Bounding boxes. Audio, both ASR and TTS, all works better that way. Now many research papers are only really about how to encode image/audio/video to feed it into a Llama or Qwen model.
[-]
- zmmmmm 1 hour ago
  It is fascinating. Vision language models are unreasonably good compared to dedicated OCR and even the language tasks to some extent.
  My take is it fits into the general concept that generalist models have significant advantages because so much more latent structure maps across domains than we expect. People still talk about fine tuning dedicated models being effective but my personal experience is it's still always better to use a larger generalist model than a smaller fine tuned one.
  [-]
  - kgeist 15 minutes ago
    >People still talk about fine tuning dedicated models being effective
    >it's still always better to use a larger generalist model than a smaller fine tuned one
    Smaller fine-tuned models are still a good fit if they need to run on-premises cheaply and are already good enough. Isn't it their main use case?
    [-]
    - bangaladore 10 minutes ago
      Latency and size. Otherwise pretty much useless.
  - jepj57 19 minutes ago
    Now apply that thinking to human-based neural nets...