Step one for my attempt to create image descriptions with CLIP. Looks like my helper model can generate some legit image descriptions.

Looks like "A giraffe eating a banana" got eaten by a lion whilst trying to descend through the latent jungle.

Ok, maybe I should start with something more common?
Trying to find "A lonely person sitting in a park checking their mobile phone."

An almost perfect descend in a more reader-friendly version:

And here we are - my first attempt at evolving image descriptions with's CLIP model by simulated annealing.

"Portrait of businesswoman talking in a library office discussing documents meeting"


"Progressive electronic artist performing live"

· · Mastodon Twitter Crossposter · 1 · 3 · 7

Cherry-picked, but legit:

"Happy family running on a beach."

"Tourist attraction is located at the end of tour."

"As we celebrate the end of day, we've always been married to see world's biggest fireworks."

Here's what the AI thinks this is:

"group of designers dressed for dancing in the spring"

"the cheerleaders perform on stage as they are playing in mud"

Who needs boring captions anyway?

"psychedelic rock artist and dancer pose for a portrait backstage"
"the new year's eve party dress"
"person is the only thing about eyes of face"
"the first thing to see world's biggest eyes"
"the eyes of eye is a must - see if you're looking"

"hairdresser in the laboratory"
"rear view of a young woman with headphones listening music"
"a woman is choosing sewing machine on her head"
"the sculpture of brain is seen"

One could almost come to believe that CLIP can read 😮

"seafood on sale"
"biggest fish for sale property"
"sale on in an aquarium"
"the sign that is located on banks of river."

"vector silhouette of a man who practices puzzle"
"karate isolated silhouette of a wrench vector"
"as a symbol, person and is shown here"
"grunge futuristic logo symbol"
"realistic number rotating symbol in the form of fluctuating square with shattered angles"

"a great idea for couple who love to have their own hair"
"hairstyles: person will be able to spend time with their lives of the world's smallest wedding"
"hairstyles for hair: actor and person are seen in a scene from the film"

I enriched my captioning model a bit by adding my famous quotes data set.

"A man reads book from his wife."
"The best thing about marriage is to be a good loser."
"The most important thing about marriage is to be a good salesman."
"A man who reads his wife is a poet."

"A mechanic repairing customer."
"Rear view of a senior man standing at washing machine."
"Blacksmith forging a liquid nitrogen in laboratory."
"A firefighter stands in front of fire burning stove."

Besides "science fiction tv program" variations it did:

"sailors aboard"
"military commander speaks during a press conference aboard ship"
"pals were later joined, wearing a red t-shirt"
"diplomatic moral force to solve a problem"
"wisdom duty leads inexorably toward conformity"

"A man carrying heavy load of snow on his head"
"A group of people in traditional costume"
"Residents enjoy the cold weather"
"Construction workers construct photovoltaic panels on the site of a nuclear power plant "

I trained another caption model that is only allowed to write like Shakespeare - let's not look at the grammar yet.

"Elsinore: the soles that loss of time"
"Miranda: even now, methinks i were at my heels"
"Have you no more but my shoes than they can"

"But to be paddling, and in my regard"
"Senator in arms down, ourselves away our salt-water shapes weapons"
"Than his sea-monster abides death. amen bond exit cressida! "
"Consult bolingbroke's deeps upon sightless coast."
"I am a plain-dealing in the ocean."

@Quasimondo this is going extremely as expected 😂 i love it. glad someone’s actually trying this onhere

Sign in to participate in the conversation

Server run by the main developers of the project 🐘 It is not focused on any particular niche interest - everyone is welcome as long as you follow our code of conduct!