8 comments

  • visioninmyblood 2 hours ago
    I was using this for video understanding with inference form vlm.run infra. It definitely has outperformed Gemini which generally is much better than openai or Claude on videos. The detailed extraction is pretty good. With agents you can also crop into a segment and do more operations on it. have to see how the multi modal space progresses:

    link to results: https://chat.vlm.run/c/82a33ebb-65f9-40f3-9691-bc674ef28b52

    Quick demo: https://www.youtube.com/watch?v=78ErDBuqBEo

    • colechristensen 19 minutes ago
      I found it pretty funny how bad Claude was at cropping an image. It was a cute little character with some text off to the side on a white background, all very clean cartoon vibes and it COULD NOT just select the character. I pursued it for 20 minutes because I thought it was funny. Of course it was 45 seconds to do it myself.

      A lot of my side projects involve UIs and almost all of my problems with getting LLMs to write them for me involve "The UI isn't doing what you say it's doing" and struggling to get A) a reliable way to get it to look at the UI so it can continue its loop and B) getting it to understand what it's looking at well enough to do something about it

  • eurekin 1 hour ago
    Insane if true... now I wonder, if I use it to go through some old dance routing video catalogue to recognize and write individual move lists
  • djmips 2 hours ago
    Does anyone else worry about this technology used for Big Brother type surveillance?
    • reactordev 2 hours ago
      Where have you been the last decade? It’s already in use, or models like it, by companies selling access to The State

      https://deflock.me

      Not to mention cloud platforms that collect evidence and process it with all the models and store that information for searching…

      https://www.revir.ai

      • eurekin 1 hour ago
        No mention of palantir?
        • bilbo0s 1 hour ago
          >It’s already in use, or models like it, by companies selling access to The State

          Doesn't that pretty much cover Palantir as well?

      • mptest 1 hour ago
        or if you prefer your depression in book format: surveillance capitalism by zuboff pegasus: a spy in your pocket laurent richard
    • basilgohar 43 minutes ago
      How do you think this tech was developed in the first place? It's probably trained and used in the surveillance bid for a decade before it comes to consumers, and this probably isn't the SoA stuff that governments have access to, we're probably 5-10 years behind what's on the cutting edge.
    • protocolture 40 minutes ago
      We got Facial Rec and LPR first, those are more dangerous for surveillance.
    • g-mork 1 hour ago
      warmly encourage you avoid reading the header files of the dahua camera SDK
  • thot_experiment 2 days ago
    anyone have a tl;dr for me on what the best way to get the video comprehension stuff going is? i use qwen-30b-vl all the time locally as my goto model because it's just so insanely fast, curious to mess with the video stuff, the vision comprehension works great and i use it for OCR and classification all the time
    • xrd 2 hours ago
      How much VRAM do you need for local usage may I ask?
  • moralestapia 2 hours ago
    To me, this qualifies as some sort ASI already.
  • spwa4 2 hours ago
    It's so weird how that works with transformers.

    Finetuning an LLM "backbone" (if I understand correctly: a fully trained but not instruction tuned LLM, usually small because students) with OCR tokens bests just about every OCR network out there.

    And it's not just OCR. Describing images. Bounding boxes. Audio, both ASR and TTS, all works better that way. Now many research papers are only really about how to encode image/audio/video to feed it into a Llama or Qwen model.

    • zmmmmm 1 hour ago
      It is fascinating. Vision language models are unreasonably good compared to dedicated OCR and even the language tasks to some extent.

      My take is it fits into the general concept that generalist models have significant advantages because so much more latent structure maps across domains than we expect. People still talk about fine tuning dedicated models being effective but my personal experience is it's still always better to use a larger generalist model than a smaller fine tuned one.

      • kgeist 15 minutes ago
        >People still talk about fine tuning dedicated models being effective

        >it's still always better to use a larger generalist model than a smaller fine tuned one

        Smaller fine-tuned models are still a good fit if they need to run on-premises cheaply and are already good enough. Isn't it their main use case?

        • bangaladore 10 minutes ago
          Latency and size. Otherwise pretty much useless.
      • jepj57 19 minutes ago
        Now apply that thinking to human-based neural nets...