[GOTO 95 logo]

[ Home | Weather | Wiki | HN | RSS | xkcd ] [ Search | Settings | About ] [ Light | Dark ]

How to Finetune GPT-Like Large Language Models on a Custom Dataset

[ Top | New | Ask | Show | Same poster | Same domain | Source site ]

Posted on Thursday, May 25th 2023 by T-A

https://lightning.ai/pages/blog/how-to-finetune-gpt-like-lar...

122 comments

[ Threaded | Oldest | Newest ]

@ Thursday, May 25th 2023 by artembugara

Have a question to the Generative AI experts here.

So, I can use smthg like GPT-4 to label data and then use that as a train set for my own LLM, right?

EDIT: adding this from OpenAI Restriction TOS: "(iii) use output from the Services to develop models that compete with OpenAI;"

 

@ Thursday, May 25th 2023 by montenegrohugo | parent

Yup, totally. This is a form of knowledge distillation. Openai, or other foundational model providers, can't really do anything about it.

 

@ Thursday, May 25th 2023 by cookieperson | parent

Well they can sue you and bankrupt you by delaying trial for a decade. That's how the US patent system works anyways...

 

@ Thursday, May 25th 2023 by sanxiyn | parent

Sue on what grounds? It will be quickly dismissed.

 

@ Friday, May 26th 2023 by mind-blight | parent

That is not how the US legal system works. You can sue someone for and regardless of merit, and they will have to defend themselves. That costs time and legal fees. If they lose, they can appeal, and continue appealing. If it's baseless, they'll lose, but you still spend a lot of money and time dealing with the lawsuits.

 

@ Friday, May 26th 2023 by cookieperson | parent

If by quickly you mean 5 to 10 years of paying a retainer on a lawyer sure. Even if you win the case you lose in life. Most individuals can't afford 500k in legal fees to have them be reimbursed years later. Big companies have lawyers on staff at a discount and they play these games every day.

This happens with illegal things all the time. IE manager sexually harasses someone on video or something, it's some CEOs nephew who did it, so they fire the person who got harassed. The person who got harassed now has to aquire legal counsel on top of paying relocation claw backs etc. Few years ago by and the person who was in the right is trying to hold down a job, a family, and the stress of the legal battle. The company offers to settle two years in for 50k and 99% of people take it, sometimes at a loss. Also, getting employed is a lot harder when a background check reveals suing a previous employer or really any company, because shocker, most companies do illegal shit regularly... So it's almost always best to settle

I realize I painted a picture pretty far from my previous statement but I figured you were new in your career and could benefit from an allegory of how stuff like this goes down.

 

@ Thursday, May 25th 2023 by wodenokoto | parent

It is my understanding that this is how "alignment" works.

That is, openAI paid people to chat with their LLM to fine tune it and then other LLMs use chatgpt to generate training data to align their models.

 

@ Thursday, May 25th 2023 by visarga | parent

There are three ways

1. make your own RLHF dataset - like OpenAI and Open Assistant

2. exfiltrate data from a bigger/better LLM - Vicuna & family

3. use your pre-trained LLM to generate RLAIF data, no leeching - ConstitutionalAI, based on a set of rules instead of labelling examples

 

@ Thursday, May 25th 2023 by cubefox | parent

I wonder whether these approaches fit into the above categories:

https://arxiv.org/abs/2305.13735

https://arxiv.org/abs/2305.11206

 

@ Thursday, May 25th 2023 by snickmy | parent

Indeed, fine tuning with either synthetic data (as you are proposing) or human review works like that. you can read more here: https://huggingface.co/blog/rlhf

 

@ Thursday, May 25th 2023 by fallingmeat | parent

That is against their ToS though if you use your new LLM commercially.

 

@ Thursday, May 25th 2023 by pmoriarty | parent

So what are they going to do about it?

 

@ Thursday, May 25th 2023 by jstummbillig | parent

That escalated quickly.

 

@ Thursday, May 25th 2023 by fallingmeat | parent

Great question! I don't know the end game there. Maybe if they suspected their model was used they would sue, and in discovery find you used their model for training?

 

@ Thursday, May 25th 2023 by visarga | parent

Maybe we don't need to worry, OpenLLaMA is under training right now. It will be the commercial version of LLaMA.

>Update 05/22/2023

>We are happy to release our 700B token checkpoint for the OpenLLaMA 7B model and 600B token checkpoint for the 3B model. We've also updated the evaluation results. We expect the full 1T token training run to finish at the end of this week.

https://github.com/openlm-research/open_llama

So we could develop on LLaMA for now and switch to OpenLLaMA later.

 

@ Thursday, May 25th 2023 by bottled_poe | parent

Nothing until it's worth their while.

 

@ Thursday, May 25th 2023 by postsantum | parent

MS lawyers have a good track record at sending out those scary cease&desist letters

 

@ Thursday, May 25th 2023 by sanxiyn | parent

I don't think that works. LLM-generated contents are not copyrightable.

 

@ Thursday, May 25th 2023 by dragonwriter | parent

Breach of contract for violating the TOS agreed to when signinf uo for the service doesn't depend on copyright.

 

@ Thursday, May 25th 2023 by nightski | parent

Right but cease and desist usually relates to intellectual property or copyright matters, typically not TOS violations. Please correct me if I am mistaken.

 

@ Thursday, May 25th 2023 by dragonwriter | parent

Cease and desist can be used for any issues where the person or entity issuing the C&D thinks they have a legal right that is being violated and wants to put the violator on notice in the hopes of securing a change in behavior short of legal action.

 

@ Thursday, May 25th 2023 by aix1 | parent

What I don't understand - is there anything that would prevent Alice from publishing ChatGPT prompts and outputs for anyone to use, with no T&C attached?

Once Alice has done that, is there anything to prevent Bob, who has never agreed to ChatGPT ToS, to use those prompts and outputs to train his own models to compete with OpenAI's?

(Purely from a contractual/legal/IP angle rather than ML/technical.)

 

@ Thursday, May 25th 2023 by pmoriarty | parent

Is a terms of service considered a contract?

 

@ Thursday, May 25th 2023 by sanxiyn | parent

They can terminate your account.

 

@ Thursday, May 25th 2023 by dragonwriter | parent

>So what are they going to do about it?

If they think they can prove you used it to develop a competing service, sue you for breaking the TOS and recover the greater of the harm it did to their business or the amount of your profits from the service that are due to the uae of GPT-4 in violation of the agreement.

 

@ Thursday, May 25th 2023 by pmoriarty | parent

Have companies managed to get awarded damages in lawsuits against their customers who merely broke their terms of service?

Is there existing case law here?

 

@ Thursday, May 25th 2023 by artembugara | parent

As far as I remember, I fully own all the right to the output of OpenAI (for example).

 

@ Thursday, May 25th 2023 by dingledork69 | parent

I wonder how they reconcile naming themselves "Open"AI, telling people that generated works can be used however they please, except for training a potential competitor.

 

@ Thursday, May 25th 2023 by vlovich123 | parent

And yet they trained theirs on commercial content on the internet. If that's legal I doubt their argument holds up in court right?

 

@ Thursday, May 25th 2023 by sanxiyn | parent

Of course it will hold up in court, it's their service and their terms of service.

 

@ Thursday, May 25th 2023 by dragonwriter | parent

They trained on publicly-available (no signup with TOS agreement) data, on the theory that training is fair use.

You signed up and agreed to their TOS to use GPT-4.

The legal situations are not similar.

OTOH, lots of people are openly using GPT-4 in one way or another to develop models, though they might generally be at arm's length from people intending to sell services.

 

@ Thursday, May 25th 2023 by flangola7 | parent

>They trained on publicly-available (no signup with TOS agreement) data, on the theory that training is fair use.

They openly state they used thousands of books from a pirate site as a training source. Go look up the datasets listed in the GPT-3 paper.

 

@ Thursday, May 25th 2023 by snovv_crash | parent

So set up a shell company that uses GPT4 to make public domain examples of what RLHF data would look like, and then the parent company takes that data afterwards since it's public domain. Shell company didn't break TOS.

 

@ Thursday, May 25th 2023 by ramesh1994 | parent

It prohibits anything that competes with OpenAI services i.e as long as you're not literally providing an LLM API commercially you should be fine

 

@ Thursday, May 25th 2023 by bagels | parent

Does it compete with them if you stop paying for their API?

 

@ Thursday, May 25th 2023 by foobarbecue | parent

Is "ca" "can" or "can't"?

 

@ Thursday, May 25th 2023 by artembugara | parent

can

 

@ Thursday, May 25th 2023 by notpublic | parent

not an AI expert but from a talk I recently heard... if there is a mismatch in training data between the "teacher" LLM and "student" LLM, you risk teaching the student to hallucinate or to ignore information

 

@ Thursday, May 25th 2023 by moffkalast | parent

>I can use smthg like GPT-4 to label data and then use that as a train set for my own LLM, right?

Yes, almost all improved LLama models are tuned exactly that way (trained on examples of questions and answers from say GPT 4). If OpenAI stole copyrighted works to train their models it is morally fair game to do the same to them regardless of their TOS. It's not like they can prove it anyway.

Plus there's the other point where they also say that everything generated by their models is public domain, so which one is it eh?

 

@ Thursday, May 25th 2023 by sirsinsalot | parent

This ... but we all know business is corrupt.

The current attempts to spur on regulation by OpenAI is moat building

 

@ Friday, May 26th 2023 by ehnto | parent

We were complacent while it happened because OpenAI wasn't a business, it wasn't seen as unethical to use community work to contribute to community research. Now they're entrenched and pulled the rug out from the community, whilst also trying to shut the door on anyone else.

Just a really disappointing series of events, the money and profit were never the big issue.

 

@ Thursday, May 25th 2023 by jrm4 | parent

I'm a lawyer so one should never break the law.

Nonethless, I can observe and predict that non-consensual "open sourcing" of these models would likely end up probably the best and safest way to do all of this stuff.

 

@ Thursday, May 25th 2023 by Fgehono | parent

Because by training it they created something new.

I don't mind just making a point.

But I don't think they mind. I don't believe that this type of model training is able to be bleeding edge which should guarantee that openai has enough motivation to continue the development and having a healthy competition

 

@ Thursday, May 25th 2023 by sp332 | parent

It's against the terms of service to do the generation, but the generated text is not copyrighted. Those are different things.

 

@ Thursday, May 25th 2023 by cameldrv | parent

GPT-4 is trained on a large number of web pages, some of which will have had their own terms of service.

 

@ Thursday, May 25th 2023 by svaha1728 | parent

Not only web sites, full books from scribd and other sources.

 

@ Thursday, May 25th 2023 by asah | parent

see LinkedIn vs HiQ (which HiQ won) covering fair use of logged-out web pages.

 

@ Thursday, May 25th 2023 by pvarangot | parent

I have to log in to OpenAI to generate conversations but the conversations I can post on my own logged-out blog. It's the same thing OpenAI would probably say if they got sued because GPT spits copyrighted content it found on a logged-out webpage. They can't reasonably expect people to not use them for training.

 

@ Friday, May 26th 2023 by ehnto | parent

Is it legal for one of their computer systems to access mine without my consent, even if publicly routable via the internet?

If I found an open port on a government computer it is still illegal for me to access that isn't it? Is the difference that this is port 80/443 and happens to serve HTTP requests something that has been described in law or court?

 

@ Friday, May 26th 2023 by winddude | parent

show me the ToS where it says that, and I still won't care, because it would absolute be legal under the same principle openAI is using for the training data as a transformative work.

FYI: here are the relevant parts from the TOS:

(iii) use output from the Services to develop models that compete with OpenAI; (iv) except as permitted through the API

sounds like you are allowed to as long as it's from the api, as this "imaginary" restriction isn't in https://openai.com/policies/api-data-usage-policies, or https://openai.com/policies/usage-policies.

 

@ Thursday, May 25th 2023 by fnordpiglet | parent

Use of copyrighted material in such a way that it's aggregated into statistical properties is almost certainly fair use. Use of the model to produce reproductions of copyrighted material then consuming or distributing it is almost certainly violating the copyright. But it was the facsimile of the material that's the violation, not the abstract use of it to generate an aggregate model.

 

@ Thursday, May 25th 2023 by tsunamifury | parent

You understand these things have a very very wide interpretation scope here that has yet to be tested in court. I wouldn't make these statements so confidently as courts tend to reinterpret the law significantly for the balance of societal factors when serious technology changes occur.

 

@ Thursday, May 25th 2023 by itake | parent

AI generated work is not copyright-able. I guess the courts later could disagree though.

https://www.copyright.gov/ai/

 

@ Thursday, May 25th 2023 by belter | parent

If the AI generates a new Eric Clapton album, with the same similar voice and guitar playing style?

 

@ Thursday, May 25th 2023 by itake | parent

your example doesn't have to be AI generated. Human cover-bands play Song X in the style of Y all the time.

 

@ Friday, May 26th 2023 by anticensor | parent

They are in the UK:

https://www.gov.uk/government/consultations/artificial-intel...

 

@ Thursday, May 25th 2023 by fnordpiglet | parent

This is true - afaik there's been no specific rulings on whether training models on copyright material is a violation. But to my mind it harkens back to stuff like xerox and such where the tool itself isn't the violating thing it's the use of the tool. Likewise, derivative works are often largely reproductions with minor variations and are protected under fair use. A model that takes enormous amounts of data and distills it into a tiny vector representation way below the information theoretic levels for any meaningful fidelity and mixes and overlaps data in a way that the original data isn't plausibly stored in the model... I'm definitely not going to wager my life that's fair use, but I would wager my company on it.

 

@ Thursday, May 25th 2023 by tsunamifury | parent

In the history of media law I've seen judged lean into whatever interpretation balances the ecosystem more than what is "literally the law". The law is meant to serve people not the other way around. I hope judges will understand the contribution and theft can't just be "haha fuck humanity love, openAI"

 

@ Thursday, May 25th 2023 by fnordpiglet | parent

Ok, what about the open source and research models? I wouldn't wager much on openai keeping a lead indefinitely. Certainly not to establish case law on what's a pretty new technology (at least in its current use)

 

@ Thursday, May 25th 2023 by jjoonathan | parent

Yes, laws are about politics and dispute resolution more than reasoning or correctness. Focusing on the pure logic is a trap for the computationally inclined.

 

@ Friday, May 26th 2023 by nl | parent

I want to train my own LLM on public but copyrighted data. I think this is serving humanity (and fucking OpenAI). I also think it is ethical because there's a big difference between "learning from" and "copying".

Your proposed reading of the law means only big tech will be able to afford the license fees to train on large amounts of data.

 

@ Saturday, May 27th 2023 by tsunamifury | parent

How do YOU plan on compensating those whose labor helped you? I bet you don't. Same thing you are just imagining being David rather than Goliath makes it ok for you.

 

@ Saturday, May 27th 2023 by mrtranscendence | parent

It's not always necessary to compensate those whose labor helped you. I haven't compensated many of the open source projects I use, for example, even those who clearly want me to (with nagging pop-ups). If the use of copyrightable material to train a model is legal, and it does not legally require compensation, it might be difficult to argue that the use of such material should be compensated or else. It would depend IMO on whether there are norms in place for this kind of thing, and I don't necessarily see wide agreement.

 

@ Thursday, May 25th 2023 by chaxor | parent

Yes, and in fact that's the best method available if you want good performance.
I would suggest using a local open source model to do this however, to cut down on costs and make it far simpler to deal with than the unwieldy OpenAI systems.

https://arxiv.org/pdf/2305.02301.pdf


@ Thursday, May 25th 2023 by Obscurity4340

This looks like the Orion broswer logo


@ Thursday, May 25th 2023 by quickthrower2

When is fine tuning worth it, rather than just prompt engineering?

 

@ Thursday, May 25th 2023 by messe | parent

When you're starting to run into context limits.

 

@ Thursday, May 25th 2023 by tstrimple | parent

From what I've seen, it's when embeddings get too large for the token limit or the embeddings drive the cost up too much because you're always operating near the max token limit. In those cases, it may be worth the up front training cost and slightly higher per-token cost to dramatically reduce the amount of tokens in the average request. If you're building a higher throughput solution, the difference in cost can be quite large.

 

@ Thursday, May 25th 2023 by snovv_crash | parent

If you want to teach it eg. all of the text in your private training manuals and internal documentation, which wouldn't fit in the input token size.

 

@ Thursday, May 25th 2023 by heliophobicdude | parent

I think these are two very separate concepts.

What we are mostly seeing when it comes to fine-tuning is making a model promptable. Models like LLaMA or the original GPT3 weren't promptable. They were fine-tuned with demonstration data that looks like a prompt input, prompt output.

See below:
{
          "instruction": "What would be the output of the following JavaScript snippet?",
          "input": "let area = 6 * 5;\nlet radius = area / 3.14;",
          "output": "The output of the JavaScript snippet is the radius, which is 1.91."
    }, [1]
Prompt engineering is really just carefully designing what inputs and outputs on a prompt-ready model work best.

I highly recommend skimming this RLHF article and looking for the parts where it talks about demonstration data [2]

1: https://github.com/sahil280114/codealpaca/blob/master/data/c...

2: https://huyenchip.com/2023/05/02/rlhf.html

 

@ Thursday, May 25th 2023 by quickthrower2 | parent

Thanks for link 2 - it is worth a proper read! Read half of it already and it is very interesting and useful for understanding this.

 

@ Thursday, May 25th 2023 by heliophobicdude | parent

Cheers!

 

@ Thursday, May 25th 2023 by baobabKoodaa | parent

Prompt engineering and fine tuning are in many cases alternative ways to achieve the same goal. You claim that the "original GPT3" wasn't promptable. I'm unsure which version you refer to, but I'm guessing you refer to text-davinci-003 and it was definitely promptable. For one app I used prompt engineering to make it behave like a spirit talking through a ouija board. For another, I used prompt engineering to make it act like a dystopian search engine from the future. So, yeah, it's promptable.

 

@ Thursday, May 25th 2023 by oddthink | parent

It's worth it whenever you have a reasonable amount of training data. You can get substantial quality improvements automatically. Unless you're doing some kind of prompt-optimization, prompt-tuning is a lot of random guessing and trial-and-error. It's also most necessary when you have a smaller base model, as opposed to one of the big ones.


@ Thursday, May 25th 2023 by swalsh

How does this compare to fine tuning something like BERT?

 

@ Thursday, May 25th 2023 by theaniketmaurya | parent

I would say similar since the building block is the transformer for both. In this blog post, the fine-tuning strategy used is Adapter. It basically adds a learnable layer to the Transformer block.


@ Thursday, May 25th 2023 by slenocchio

Can someone explain why I'd want to use fine-tuning instead of a vector database (or some other way of storing data/context)?

 

@ Thursday, May 25th 2023 by mgfist | parent

First reason that comes to mind is you can make much smaller models, which helps with latency, cost and may enable you to run the model locally.

 

@ Thursday, May 25th 2023 by pid-1 | parent

I've been playing with using documents as
OpenAI embeddings for the past weeks and, at least for my use case, the results are meh. It seems sometimes just using context is not enough.

My next step is to play with fine tunning, but I have no results to report yet.

 

@ Thursday, May 25th 2023 by santiagobasulto | parent

I'd be very interested in knowing the outcome. Do you blog anywhere (or post on social)?

 

@ Thursday, May 25th 2023 by deforciant | parent

have you tried other models to generate embeddings? I am going to that direction too to create an additional layer of helpers for search.
Also, thinking if the document is not too big, it might fit into the initial context with the prompt

 

@ Thursday, May 25th 2023 by akiselev | parent

Try using InstructXL for embeddings. It's got a more complex prompt structure for generating embeddings which might be more useful

 

@ Friday, May 26th 2023 by potatoman22 | parent

If the documents are large, try embedding smaller portions. If there's a heavy domain vocabulary, you might need a custom model.

 

@ Thursday, May 25th 2023 by mountainriver | parent

I think it probably works a lot better, but I would love to see some research validating this

 

@ Thursday, May 25th 2023 by chadash | parent

I've read in a few places that it actually works worse in most cases. Much better to put the context in your prompt.

 

@ Thursday, May 25th 2023 by CuriouslyC | parent

Fine tuning + context will outperform context alone, and it's cheaper to burn cycles fine tuning then use a smaller context than to use a larger context in production.

 

@ Thursday, May 25th 2023 by Guillaume86 | parent

Fine tuning + same context will probably outperform context alone, but if you use a smaller context that does not seem to work that well as GP stated.

 

@ Thursday, May 25th 2023 by oddthink | parent

Wouldn't a vector database just get you nearest-neighbors on the embeddings? How would that answer a generative or extractive question? I can see it might get you sentiment, but would it help with "tell me all the places that are mentioned in this review"?

 

@ Thursday, May 25th 2023 by superchink | parent

i think the point is that you use the vector database to locate the relevant context to pass to the LLM for question answering. here's an end-to-end example:

https://www.dbdemos.ai/demo.html?demoName=llm-dolly-chatbot

 

@ Friday, May 26th 2023 by montag | parent

Right. You feed the text chunks (from the matched embeddings) to a generative LLM to do the extractive/summarization part.

 

@ Thursday, May 25th 2023 by heliophobicdude | parent

Assuming you would want to fine-tune over a codebase or set of documents, I would argue vector databases and fine-tuning are completely different tools.

I would strongly recommend against fine-tuning over a set of documents as this is a very lossy information system retrieval system. LLMs are not well suited for information retrieval like databases and search engines.

The applications of fine-tuning that we are seeing have a lot of success is making completion models like LLaMA or original GPT3 become prompt-able. In essence, prompt-tuning or instruction-tuning. That is, giving it the ability to respond with a user prompt, llm output chat interface.

Vector databases, for now, are a great way to store mappings of embeddings of documents with the documents themselves for relevant-document information retrieval.

I would highly recommend skimming this RLHF paper for how demonstration data was used to make a model prompt-able [1]. Keep in mind RLHF is another concept all together and we might be seeing a revolution where it might become optional (thanks to LIMA)!

1: https://huyenchip.com/2023/05/02/rlhf.html

 

@ Friday, May 26th 2023 by a_bonobo | parent

Great reply, here's an example from my own work:

I want the user to be able to ask technical questions about a set of documents, then the user should retrieve a summary-answer from those documents along with a source.

I first need to finetune GPT4 so it better understands the niche-specific technical questions, the words used, etc. I could ask the finetuned model questions, but it won't really know from where it got the information. Without finetuning the summarised answer will suffer, or it will pull out the wrong papers.

Then I need to use a vector database to store the technical papers for the model to access; now I can ask questions, get a decent answer, and will have access to the sources.

 

@ Friday, May 26th 2023 by heliophobicdude | parent

Ah! That makes sense! That's a neat strategy!

 

@ Friday, May 26th 2023 by forgingahead | parent

Thanks (to both you and the parent) for sharing these details. So is it fair to say the following:

1. Fine-tuning bakes the knowledge into the model, but getting the "source" of an answer to a specific question becomes cagey and it is unclear if the answer is accurate or just a hallucination.

2. Therefore vector databases, which can provide context to the LLM before it answers, can solve this "citation" problem, BUT:

3. We then have limits because of the context window of the LLM to begin with.

Is that a fair understanding, or have I totally gotten this incorrect?

Edit: Or, are you saying that you both fine-tune AND also use a vector database which stores the embeddings of the dataset used to fine-tune the model?

 

@ Thursday, May 25th 2023 by swalsh | parent

Fine Tuning = Output

Embeddings = Input

Fine-tuning is like a chef modifying a general pizza recipe to perfect a specific pizza, such as Neapolitan. This customization optimizes the result. In AI, fine-tuning adjusts a pre-existing model to perform better on a specific task.

Embeddings are like categorizing ingredients based on properties. They represent inputs so that similar inputs have similar representations. For instance, 'dog' and 'puppy' in an AI model have similar meanings. Like ingredients in a pizza, embeddings help the model understand and interpret the inputs. So, fine-tuning is about improving the model's performance, while embeddings help the model comprehend its inputs.

It turns out, you can search a vector space of embeddings to find similar embeddings. If I turned my above post into 2 embeddings, and you searched for "golden retreiver" though neither paragraph has that exact phrase, the model should know a golden retreiver is most similar to the second paragraph that compares puppy to dog.

 

@ Thursday, May 25th 2023 by SparkyMcUnicorn | parent

I like to think of an LLM as a literal human. Not sure if it's the best analogy.

Fine tuning = Adding years of experience, in a set environment. e.g. Raise them in a home that only speaks in old english, learn pig latin, send them to a bootcamp.

Embedding = Giving them a book to reference information.

Just like a human, memory might fade a bit through the years but old habits die hard. You might not perfectly recollect what you learned years ago, but you still get the general idea, and if you took a class on the referenced book you'll be better at relaying information from it.

Edit: Asked ChatGPT to create the analogy.

A language model is like an intelligent person.

- Pre-training is their broad education and general knowledge.

- Fine-tuning is their years of specialized experience in a specific field.

- Embedding is like giving them a comprehensive book on a particular subject.

Just as a person gains knowledge, expertise, and specialized resources, the language model develops its understanding and performance through pre-training, fine-tuning, and embedding.

 

@ Thursday, May 25th 2023 by morgango | parent

I asked ChatGPT this question, and asked it to simplify as much as possible.

Fine-tuned Models: Imagine you have a super-smart robot that can talk about anything. But you want it to be really good at talking about, say, dinosaurs. So, you teach it more about dinosaurs specifically. That's what fine-tuning is - you're teaching the robot (or model) to be really good at a specific topic.

Vector Databases and Embeddings with LLM: This might be a little tricky, but let's think of it this way. Imagine you have a huge library of books and you want to find information on a specific topic, say, ancient Egypt. Now, instead of reading every book, you have a magical index that can tell you which books talk about ancient Egypt. This index is created by magically converting each book into a "summary dot" (that's the embedding). When you ask about ancient Egypt, your question is also converted into a "summary dot". Then, the magical index finds the books (or "summary dots") that are most similar to your question. That's how the vector database and embeddings work.

So, if you want your super-smart robot to be really good at one specific topic, you use fine-tuning. But if you want it to quickly find information from a huge library of knowledge, you use vector databases and embeddings. Sometimes, you might even use both for different parts of the same task!

 

@ Thursday, May 25th 2023 by anon373839 | parent

Fine-tuning could be useful to get a high text completion quality out of a small model within a specific domain. You would still use the resulting model alongside an info retrieval system to prompt with real context (unless you have a use case where hallucination is a feature).


@ Thursday, May 25th 2023 by mercurialsolo

While the fine-tuning pipeline is fairly straightforward for tuning and building custom models, the RLHF pipeline doesn't look to be as straightforward. Creating a dataset for RLHF seems like a fairly labour intensive exercise especially if your model is tuned to do work like code generation ?

What about the Replit Ghostwriter? Did it have a RLHF phase?


@ Thursday, May 25th 2023 by nico

What is the main difference between training and fine tuning?

Can you start with a model trained only in producing the letter a, and then fine tune it to learn b, then c, then words, sentences, etc?

 

@ Thursday, May 25th 2023 by worldsayshi | parent

Yeah, since fine tuning seems to be so much more cheaper than training why haven't OpenAI fine tuned ChatGPT on data past 2021?

 

@ Thursday, May 25th 2023 by ajb117 | parent

My guess is that it's because they've already done RLHF on top of the standard next token prediction. In other words, they can't cheaply fine tune ChatGPT without undoing the RLHF objective by training on next token prediction with post-2021 data, and then retraining with RLHF to make sure it still gives good human-like output.

I mention the "undoing RLHF" since it's not uncommon for fine-tuned models to increase in error in the original training objective after being fine-tuned with a different one. I think people saw this happen in BERT.

Also ChatGPT is almost certainly huge.

 

@ Thursday, May 25th 2023 by heliophobicdude | parent

One argument is that it can contaminate training data from output of itself or other models.

We have already documented evidence of the effect of this. In the GPT-4 technical report [1], they reported contamination of humaneval data in the training data.

They did measure against a "non-contaminated" training set but no idea if that can still be trusted.

Why would this matter? We can have seemingly strong benchmarks for containments but measures poorly against new and quarantined information. Classic over fitting.

Another argument is that data being put out there could very much be wrong and the amounts of it amplified by other models. Take a look at this sample of demonstration data for codealpaca [2]. Not only is its output wrong but bad practices like,making up a random computation without it having access to a place to run a calculation, teaches the model these type of responses are ok.

{ "instruction": "What would be the output of the following JavaScript snippet?", "input": "let area = 6 * 5;\nlet radius = area / 3.14;", "output": "The output of the JavaScript snippet is the radius, which is 1.91." }

1: https://cdn.openai.com/papers/gpt-4.pdf
2: https://github.com/sahil280114/codealpaca/commit/0d265112c70...

 

@ Thursday, May 25th 2023 by londons_explore | parent

Ideally you train a model right to begin with, and no fine tuning is necessary.

However, sometimes you can't do that. For example, perhaps you want your model to always talk like a pirate, but you don't have billions of words spoken like a pirate to train on.

So the next best thing is to train a model on all english text (which you have lots of), and then finetune on your smaller dataset of pirate speech.

Finetuning is simply more training, but with a different dataset and often a different learning rate.

Typically, finetuning uses far far far less data and compute, and can be done by individuals with a home PC, whereas training a large language model from scratch is in the $1M - $1B range.

 

@ Thursday, May 25th 2023 by swalsh | parent

Not an expert, but my high level understanding is this: If a model is a set of inputs, some middle layers, and a set of outputs. Fine tuning concentrates on only the output layers.

Useful for taking a generic model with a base level of knowledge, and tuning it so the output is more useful for an application specific use case.

 

@ Thursday, May 25th 2023 by ajb117 | parent

I think that's more in line with transfer learning, a variant of fine-tuning. If I'm reading this article correctly, they're fine-tuning the LMs end-to-end.

 

@ Friday, May 26th 2023 by RockyMcNuts | parent

not strictly true I think

- you could add new units throughout and train those while freezing existing units (adapter-based fine-tuning)

- you could train all units and use e.g. low-rank adaptation to limit how much they can change

- you could do prefix tuning and train an input to add at every layer

see e.g. - https://lightning.ai/pages/community/article/understanding-l...

 

@ Friday, May 26th 2023 by Taek | parent

For "full fine tuning", mathematically there's no difference. Fine tuning is just extending the training on new data.

What you are suggesting is called "curriculum learning", and though it hasn't been applied to LLMs yet to the best of my knowledge, it has proven to improve learning and decrease training times in other areas of ML.


@ Thursday, May 25th 2023 by hospitalJail

Has anyone tried to use this?

The guide obv didn't make usable code and the github looks nearly unrelated.

I'm somewhat surprised there isnt a parameter for 'input_data' and 'output_data' and it returns a trained model. I can't figure out why there is so much boilerplate when that stuff could be contained as parameters.

 

@ Friday, May 26th 2023 by lyapunova | parent

I got stuck.

Try these: https://huggingface.co/blog/stackllama, https://huggingface.co/blog/trl-peft, https://huggingface.co/blog/hf-bitsandbytes-integration


@ Thursday, May 25th 2023 by akrymski

These NanoGPT based models are great, thank you for contributing to OS. Would love to see this ported to CPUs ala llama.cpp. Any plans in that direction?


@ Thursday, May 25th 2023 by nomagicbullet

Is there are Dreambooth equivalent for fine-tuning ChatGPT as there is for Stable Diffusion? I have to imagine that if we can add custom data to a DL text-to-image model, we should be able to do the same with a text-to-text one.

Edit to add: There are a number of Google Colabs for fine-tuning SD and I wonder if there are (or if it is technically feasible) to accomplish the same with other txt2txt models.

 

@ Thursday, May 25th 2023 by SparkyMcUnicorn | parent

These aren't for ChatGPT, but work on LLaMA, Vicuna, etc.

https://github.com/oobabooga/text-generation-webui/blob/main...

https://github.com/zetavg/LLaMA-LoRA-Tuner

https://github.com/h2oai/h2o-llmstudio

https://github.com/rhulha/lora

 

@ Thursday, May 25th 2023 by a5huynh | parent

If you're running the text-generation-webui (https://github.com/oobabooga/text-generation-webui) it has the ability to train LoRAs.

It'll require a beefy GPU but I've seen some fun examples like someone training a LoRA on Skyrim books.


@ Thursday, May 25th 2023 by stoptrlling

Anyone knows the computational cost of training with these LoRa designs? Given that we are talking about rates of token per seconds, it seems training a bigger dataset could be extremely expensive

 

@ Thursday, May 25th 2023 by t-vi | parent

The adapter and LoRa have a drastically fewer parameters, so one might expect that forward + backward is roughly 2x the cost of forward.

Then (as far as I know), in contrast to generation, training is done on the entire output of the transformer (so all tokens of the full input) rather than serially token-by-token (in the RNN days, this was called teacher-forcing), so that may give you a significant boost in the tokens per second rate over generation.


@ Thursday, May 25th 2023 by jpe90

Would it be feasible to fine-tune a large, capable model (like the recent LIMA) on the source code (and maybe a few high quality libraries) of a niche language, such that it's much better at helping you write and understand it?

Imagine how many doors it would open if you could fine-tune models capable of writing language bindings for you and keeping them up to date.

 

@ Thursday, May 25th 2023 by tazjin | parent

Totally. GPT-4 can already do this, untuned, on niche languages and libraries. One of the main problems is still that you don't know when it's hallucinating a function or whatever though.


@ Thursday, May 25th 2023 by sandGorgon

has anyone here used EasyLM ? it seems the most used for the best finetuned models out there.

 

@ Friday, May 26th 2023 by winstonprivacy | parent

Sounds interesting. Curious if there is a tutorial for this.


@ Thursday, May 25th 2023 by zhwu

It seems training the Vicuna on custom dataset could be quite easy as well, according to the following:
https://github.com/skypilot-org/skypilot/tree/master/llm/vic...


@ Friday, May 26th 2023 by lyapunova

I have been working in this space for quite a while and while I think the beginning of pytorch lightning meant well, it seems modern use-cases have outgrown it.

These days, when I see content from Lightning AI, I prepare for a contrived approach to doing something that fits within the ecosystem. I can't help but feel they are trying to induce "vendor lock-in" where there really isn't a business case for it...

Anyways, I tried to follow these steps and hit a dead-end. I have to say the content put out by huggingface is always way more straightforward and gets me to where I need to be when I want to spin up quickly.


Search Hacker News
 

Hacker News provided by Y Combinator and Algolia.
These pages best viewed with Netscape Navigator 1.1 or later.
Privacy policy and session data management.

[W3 Validator] [Netscape Now] [FREE Internet Explorer]