• CocksuckerDynamo@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    Its would be better if they provide single batch information for normal inference on fp8.

    better for who? people that are just curious or people that are actually going to consider buying H200s?

    who is buying a GPU that costs more than a new car and using it for single batch?

    • Aaaaaaaaaeeeee@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      10 months ago

      Its useful for people who want to know the inference response time.

      This wouldn’t give us a 4000 ctx reply in 1/3 of a second.

      • CocksuckerDynamo@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        10 months ago

        Its useful for people who want to know the inference response time.

        No, it’s useful for people who want to know the inference response time with batch size 1, which is not something that prospective H200 buyers care about. Are you aware that deployments in business environments for interactive use cases such as real time chat generally use batching? Perhaps you’re assuming request batching is just for offline / non-interactive use, but that isn’t the case.