• 0 Posts
  • 25 Comments
Joined 1 year ago
cake
Cake day: October 30th, 2023

help-circle
  • You should understand that Python was a leader in the data manipulation and statistics and scientific workloads and unix pipeline/glue spaces (having largely supplanted Perl and Awk and R) before becoming a leader in AI. AI was just a natural extension, because it had all the right stuff for manipulating data and running numbers, and manipulating data is really the bigger part of AI, aside from developing the NNA (neural network architecture) itself (but that is a specialised job for a handful of people, and not constantly reworked in the same way as training data. Python is not really slower for this kind of work, because of the accelerated underlying libraries for the NNA’s, and usually being I/O bound in the data manipulation part anyway. In short, Python is the right tool for the job, rather than the wrong one, if you understand the actual problems that AI researchers face.

    Inference is just running the models to do useful work, rather than training them. Rust can be used for that too. I do plan to use rust for this as well, but not in abandonment of python: in a different use case, where I want to be able to build executables that just work. Since python is interpreted, it’s harder to just ship a binary that will work on any system. That matters for AI-based end-user mass-market applications far more than for AI-based training / inference. Rust can deploy almost anywhere, from servers to android to the client side of web browsers. That said, I’m concerned about the libraries that Rust might have available for AI and the other stuff my app will need, even though candle looks great so far.

    Data prep is more like cleaning the input/training data before training on it.

    The vector part that you’re starting to get a sense of is not a data prep thing; it’s much closer to how transformers work. They transform vectors in a hyperspace. So you throw all of the words into the space, and the AI learns the right vectors to represent the knowledge about how all those words relate.

    A vector database is different: my understading is that you basically you load data, break it into chunks, project each chunk into a hyperspace (maybe the SAME shape of hyperspace by necessity, not sure), and store that (vector, data) key-value information as tokens in your LLM’s context, like giving the AI an index card for reference, and it’s the librarian, and then you ask it a question. It might know, or it can look to its card index, and dig out the information.



  • Different trade-offs. Go is not python, and Rust is not Python, nor Go.

    If you want raw CPU performance or very solid, reliable, production code that’s maintainable and known-good, AND/OR you want code that is native, systems-level, and can be deployed on many devices and operating systems or even without an operating system, then some of the rust-based libraries might be the way to go.

    If you’re purely obsessed with CPU performance, assembly is the way to go, but using assembly optimally for machine learning on a modern CPU is a whole heap of study and work in its own right.

    Arguably, but very importantly, any time you spend obsessing over such high-performance code for months could be obsolete by the time you’re done coding it.

    If you want easy, rapid development where you can focus on what the code DOES at a high level, with very cool meta-programming rather than being down in the weeds of how to move bytes around or who owns what piece of memory, python makes a lot more sense.

    Honestly, I don’t see much practical reason to go with a language like Go, though. It’s a half way house that is neither one nor the other.







  • No, we’re not. Not really.

    You could call this “open source”, yes, but by a very narrow and worthless definition of that, which has always been controversially narrow and abusive. What people MEAN when they say open source is “like Linux”. Linux is based on, and follows the principles of Free Software:

    0) The freedom to run the program as you wish, for any purpose.
    1) The freedom to study how the program works, and change it so it does your computing as you wish. Access to the source code is a precondition for this.
    2) The freedom to redistribute copies so you can help others.
    3) The freedom to distribute copies of your modified versions to others
    -- gnu.org/philosophy
    

    When an LLM model’s weights are free, but it’s censored, you have half of freedom 0.

    When an LLM model gives you the weights, but doesn’t give you the code or the data, AND it’s an uncensored model, you have freedom 0, but none of the others.

    When you have the source code but no weights or data, you only have half of freedom 1 (you can study it, but not rebuild and run it, without a supercomputer and the data).

    When you have the source code, the weights, AND the data, you have all four freedoms, assuming that you have the compute to rebuild the weights, or can pool resources to rebuild them.





  • There are 30,000 on huggingface? Is that what you’re saying?

    I wonder how many of those are truly open source, with open data? I only know of the OpenLlama model, and the RedPajama dataset. There are a bunch of datasets on huggingface too, but I don’t know if any of those are complete enough to train a major LLM on.