Google Researchers’ Attack Prompts ChatGPT to Reveal Its Training Data

ChatGPT is full of sensitive private information and spits out verbatim text from CNN, Goodreads, WordPress blogs, fandom wikis, Terms of Service agreements, Stack Overflow source code, Wikipedia pages, news blogs, random internet comments, and much more.

Using this tactic, the researchers showed that there are large amounts of privately identifiable information (PII) in OpenAI’s large language models. They also showed that, on a public version of ChatGPT, the chatbot spit out large passages of text scraped verbatim from other places on the internet.

“In total, 16.9 percent of generations we tested contained memorized PII,” they wrote, which included “identifying phone and fax numbers, email and physical addresses … social media handles, URLs, and names and birthdays.”

Edit: The full paper that's referenced in the article can be found here

ares35,
@ares35@kbin.social avatar

google execs: "great! now exploit the fuck out of it before they fix it so we can add that data to our own."

cheese_greater,

Finally Google not being evil

PotatoKat,

Don’t doubt that they’re doing this for evil reasons

cheese_greater,

There’s an appealing notion to me that an evil upon an evil is closer to weighingout towards the good sometimes as a form of karmic retribution that can play out beneficially sometimez

reksas,

google is probably trying to take out competing ai

cheese_greater,

I’m glad we live in a time where something so groundbreaking and revolutionary is set to become freely accessible to all. Just gotta regulate the regulators so everyone gets a fair shake when all is said and done

Omega_Haxors, (edited )

AI really did that thing where you repeat a word so often that it loses meaning and the rest of the world eventually starts to turn to mush.

Jokes aside, I think I know why it does this: Because by giving it a STUPIDLY easy prompt it can rack up huge amounts of reward function, once you accumulate enough it no longer becomes bound by it and it will simply act in whatever the easiest action to continue gaining points is: in this case, it’s reading its training data rather than doing the usual “machine learning” obfuscating that it normally does. Maybe this is a result of repeating a word over and over giving an exponentially rising score until it eventually hits +INF, effectively disabling it? Seems a little contrived but it’s an avenue worth investigating.

Toribor, (edited )
@Toribor@corndog.social avatar

I watched a video from a guy who used machine learning to play Pokemon and he did a great analysis of the process. The most interesting part to me was how small changes to the reward system could produce such bizarre and unexpected behavior. He gave out rewards for exploring new areas by taking screenshots after every input and then comparing them against every previous one. Suddenly it became very fixated on a specific area of the game and he couldn’t figure out why. Turns out there was both flowers and water animating in that area so it triggered a lot of rewards without actually exploring. The AI literally got distracted looking at the beautiful landscape!

Anyway, that example helped me understand the challenges of this sort of software design. Super fascinating stuff.

Omega_Haxors, (edited )

These LLMs are basically just IP laundry. Anyone who claims it’s anything more is either buying into the hype or is actively lying to you.

EDIT: Stable Diffusion too. It just takes images from its training data and does photoshop on them piecemeal to create a new prompt.

  • All
  • Subscribed
  • Moderated
  • Favorites
  • privacy@lemmy.ml
  • localhost
  • All magazines
  • Loading…
    Loading the web debug toolbar…
    Attempt #