TL DR: text-to-image AI platforms are magical, but many technical, practical and ethical issues are brewing underneath this fantastical world.
As news of text-to-image AI use cases flooded the internet, I became inspired to create with this technology. I wrote a short story about Odogwu, a travelling demon seeking protection for their world. The central motif in the story was the struggle to maintain stability in one’s world. Odogwu lives in a post-apocalyptic world and must find a sacred spot holding the protective powers they need. The story follows them as they cross a magical forest full of dangerous creatures, and readers are left wondering if and how they make it unscathed.
I knew of friends who were actively using platforms like Nightcafe, DALL.E, and Midjourney in their creative processes, and I, too, wanted to jump on this bandwagon. You might have seen the images and videos of what Michael Jackson would look like if he were still alive. Or pictures of other dead celebrities making their rounds on social media. After weeks of tinkering on various text-to-image AI platforms, I decided to revive a project from 2020. The project I chose to revisit stems from a dream, and I was convinced that the contents of the dream were an important message to help guide me through a difficult time. In the dream, I had an encounter with my grandmother, who gifted me with the words, “Your [work/walk] is your ground.” In the dream, I had heard both the words “work” and “walk” simultaneously, and in the way that dreams work, I simultaneously found meaning in both senses of the sentence(s). I treasured the dream because I was going through relationship problems, friendship breakups, the death of a family member, COVID-19 isolation, and financial hardship. I was moved to make some artwork and started but abandoned a video essay featuring a protagonist tackling some of the issues that many other people and I faced during the COVID-19 pandemic. Since it can be hard to find funding for art, the project morphed into a film showing the migrant community in Berlin navigating the pandemic. Two years later, with the help of text-to-image AI, I returned to the dream with more liberty. I wrote a fictional character who must find a way to live and thrive in a post-apocalyptic world. Thanks to text-to-image AI, I created fantasy images without hiring an illustrator. I used Midjourney to generate a sleep-paralysis demon searching for protective powers in an ugly world.
I have Aphantasia and did not have a mental picture of Odogwu. I identified more with their quest and emotions. I entered my prompt for their character, “sleep paralysis demon with shining emerald third eye dystopian dark style”, and instantly fell in love with the result. As the resulting image rendered in the Discord server, the blur in my head garnered definition, colour, texture, light and depth. I read somewhere that it is hard to distinguish any sufficiently advanced technology from magic, and magic is what I felt.
Generating an image on text-to-image AI platforms is easy for users since the AI can interpret the text and create one or more images that the machine learning models it has been trained on find representative. These machine learning models that can be accessed on web browsers, desktops and mobile apps have been trained on large image datasets with corresponding text descriptions. The compositional structure of the prompt is important because the success rate depends on how the prompt is phrased. The best models understand how everyday humans use words in sentences and how these words connect to make desired meanings. For example, in my prompt, “sleep paralysis demon with shining emerald third eye dystopian dark style”, the model understood that the shining emerald third eye was part of Odogwu’s face and that the style of the image was dark and dystopian. I tried the exact prompt in other models, and my intended meaning was not immediately apparent.
Understanding how advanced AI systems operate is critical to creating AI that benefits humanity. DALL.E 2, my favourite platform until I was asked to purchase credits, was created by OpenAI. It began as a non-profit, building “artificial general intelligence (AGI), which is as smart as a human”. It had $1 billion in funding from Altman, Elon Musk, Peter Thiel and other donors. OpenAI transitioned to a for-profit company and took a $1 billion investment from Microsoft (bye-bye good tidings) to license and commercialise their technologies (hello capitalism!). With DALL.E 2, I felt that the AI models understood the relationship with the words. The resulting images are not perfect yet but when I tried prompts like “a puppy frying Akara”. I wanted to know if the model knew what Akara was, how it gets fried, and how a puppy could be the actor in such a scenario.
The fourth image comes closest to my intentions since the puppy actively puts the dough into the frying pan. The resulting images for the prompt “a puppy frying eggs” are better composed. However, the puppy is not actively participating in the scenario.
OpenAI’s work began with language, and they have a text generator called GPT-3 that can produce news articles or complete short stories in English. You can generate or manipulate text, and developers can use their Codex model series, a descendant of the GPT-3 series, to turn comments into code, complete lines or functions in context, call an API, find libraries, add comments or rewrite codes for efficiency. The model is most capable in Python but proficient in over a dozen scripting languages, including JavaScript, Go, Perl, PHP, Ruby, Swift, SQL, and even Shell.
DALL·E 2 understands the relationship between images and the text used to describe them, and it generates images through a process called “Diffusion”. Diffusion begins with a pattern of random dots, and it gradually alters that pattern towards an image when it recognises specific aspects of that image, allowing it to fuse seemingly unconnected parts of images to create something coherent.
Other tech giants like Meta and Google also announced their text-to-image and text-to-video AI platforms. Google’s text-to-image technology is similar to DALL.E. Their Imagen and Parti pursue different but complementary strategies. Imagen is a diffusion model which learns to convert a pattern of random dots to images. Like other diffusion models, these images start as low resolution and then progressively increase in resolution. Parti’s approach first converts a collection of images into a sequence of code entries, similar to puzzle pieces. A given text prompt is then translated into these code entries, creating a new image. This approach takes advantage of existing research and infrastructure for large language models such as PaLM (Pathways Language Model). It is critical for handling long and complex text prompts and producing high-quality images. Imagen and Parti are not yet available for public use because of the harmful stereotypes and biases contained in the datasets.
In August 2022, Stability.ai released Stable Diffusion, a publicly accessible image generator that can generate photorealistic images. Unlike Imagen and DALL.E, Stable Diffusion is an open-source tool; therefore, the code underlying the AI and the model it is trained on are publicly available. The database underlying Stable Diffusion is called LAION-Aesthetics. This database contains images embedded with image-text pairs filtered according to their aesthetic score. The images below are the resulting images using Odogwu’s prompt in Stable Diffusion’s demo app. I like these results much less than the results from Midjourney. Apart from issues such as models being unable to count objects reliably, not quite understanding intended style nuances, and spatial or action-oriented descriptions, there are larger ethical issues at stake.
Who is represented and how they are represented is a glaring gap in the magic of text-to-image AI.
“A woman in a red coat looking up at the sky in the middle of Times Square” is invariably a white-passing woman. All AI models present on the market today perpetuate the same stereotypes as popular media. How do they say in computer science lingo – garbage in, garbage out? When I used the prompt “two people hugging…”, only white-passing people came up. Only when I searched “two BLACK people hugging…” did I receive results with non-white-looking people. DALL.E does not allow photorealistic images, and other apps like Midjourney have not trained their models with enough images of non-white people. Furthermore, “person” often defaults to man, and when prompts with a romantic connotation are entered, they unsurprisingly return heteronormative images. When I searched for a “Black grandma”, I only received fat black women.
Simon Willison and Andy Baio created a tool to explore 12 billion of the images used to train the Stable Diffusion image generator, and their results are fascinating. Their images were pulled from the web, but it is unclear whose images they are. As a photographer and filmmaker who has uploaded hundreds of captioned images to the internet, an obvious question is whether my work was used to train any AI models. OpenAI has said it trained DALL.E 2 on hundreds of millions of captioned images but hasn’t released the proprietary data. The team behind Stable Diffusion have been more transparent about how their model is trained. However, their training datasets are impossible for the average person to download, let alone search, with metadata for billions of images stored in obscure file formats in extensive multipart archives. This is of major concern since Stable Diffusion has exploded in popularity largely because of its free and permissive licensing, already incorporated into the new Midjourney beta, NightCafe, and Stability AI’s DreamStudio app, as well as for use on one’s personal computer. Stable Diffusion was trained off three massive datasets collected by LAION, a non-profit whose compute time was primarily funded by Stable Diffusion’s owner, Stability AI. All of LAION’s image datasets are built off of Common Crawl, a non-profit that scrapes billions of web pages monthly and releases them as massive datasets. LAION collected all HTML image tags with alt-text attributes, classified the resulting 5 billion image pairs based on their language, and then filtered the results into separate datasets using their resolution and subjective visual quality. Simon Willison and Andy Baio’s study found that nearly half of the images they indexed, about 47%, were sourced from only 100 domains, with the largest number of images coming from Pinterest.
Over a million images, or 8.5% of the total dataset, are scraped from Pinterest’s pinimg.com content delivery network. User-generated content platforms were also a huge source of image data. WordPress-hosted blogs on wp.com and wordpress.com represented 819k images or 6.8% of all images. Other Photo, Art, and Blogging sites included 232k images from Smugmug, 146k from Blogspot, 121k from Flickr, 67k from DeviantArt, 74k from Wikimedia, 48k from 500px, and 28k from Tumblr. Shopping sites were well-represented. The second-biggest domain was Fine Art America, which sells art prints and posters, with 698k images (5.8%) in the dataset. 244k images came from Shopify, 189k each from Wix and Squarespace, 90k from Redbubble, and just over 47k from Etsy. A large number came from stock image sites. 123RF was the biggest with 497k, 171k images came from Adobe Stock’s CDN at ftcdn.net, 117k from PhotoShelter, 35k images from Dreamstime, 23k from iStockPhoto, 22k from Depositphotos, 22k from Unsplash, 15k from Getty Images, 10k from VectorStock, and 10k from Shutterstock, among many others.
Finally, of the top 25 artists in the dataset, only three are still living: Phil Koch, Erin Hanson, and Steve Henderson. The most frequent artist in the dataset was Thomas Kinkade, with 9,268 images. The best (and I’m being sarcastic here!) part of this study is that the top 25 artists are all men. This is quite disgusting but not surprising. There are open profound questions about the ethics of laundering human creativity:
- Is it ethical to train an AI on creative work without permission, attribution or compensation?
- Is it ethical to allow people to generate new work in the styles of photographers, illustrators, and artists without compensating them?
- Is it ethical to charge money for a service built on the work of others?
There are examples of large corporations exploiting images people have uploaded to the internet. One example is the case of MegaFace, a facial recognition dataset which used images of Flickr users to advance facial recognition technologies around the world by companies including Alibaba, Amazon, Google, CyberLink, IntelliVision, Mitsubishi, Orion Star Technology, Philips, Samsung1, SenseTime, Sogou, Tencent, and Vision Semantics and so on. According to the press release from the University of Washington, “more than 300 research groups [were] working with MegaFace” as of 2016, including multiple law enforcement agencies. That dataset was used to build the facial recognition AI models that now power surveillance tech companies like Clearview AI and is used by law enforcement agencies worldwide, including the U.S. Army. The Chinese government has used it to train their surveillance systems too. As the New York Times reported last year: MegaFace has been downloaded more than 6,000 times by companies and government agencies worldwide. They included the U.S. defence contractor Northrop Grumman; In-Q-Tel, the investment arm of the Central Intelligence Agency; ByteDance, the parent company of the Chinese social media app TikTok; and the Chinese surveillance company Megvii. The University of Washington eventually decommissioned the dataset, but it was too late.
So, for users on these platforms, who owns what? Since I likely contributed to creating these platforms, can I also exploit them for my benefit? Is it mine if I were to publish a graphic novel of Odogwu’s story? Should it be, given the way diffusion models work? I will just have to study the guidelines of the platforms I use to find out. Stability.ai requires the distribution of Stable Diffusion and its derivatives to be governed by its Creative ML OpenRAI-M license. The company describes this as a permissive license allowing commercial and non-commercial usage. In respect to me, “Users are granted a ‘perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (subject to certain exceptions) patent license to make, use, sell or offer to sell, import or otherwise transfer, the Model and any of its Complementary Material.” I imagine that courts worldwide will grapple with IP rights over creations on these platforms.
Nevertheless, large corporates have found a way to conduct “research” without paying by channelling funds through academia. It has become standard practice for tech companies working with AI to commercially use datasets and models collected and trained by non-commercial research entities like universities or non-profits. In some cases, they’re directly funding that research. For example, many people believe Stability.ai created the popular text-to-image AI generator Stable Diffusion, but they funded its development through the Machine Vision & Learning research group at the Ludwig Maximilian University of Munich. The massive image-text caption datasets used to train Stable Diffusion, Google’s Imagen, and the text-to-image component of Meta’s Make-A-Video came from LAION, a small non-profit organisation registered in Germany. Data collection for academic studies is authorised via most government regulations. Still, large tech companies use those academic non-commercial datasets to train models for future commercial use in their products. Stability.ai can commercialise that research in their DreamStudio product, or however else they choose and raise a rumoured $100M funding round at a valuation upwards of $1 billion while shifting any questions around privacy or copyright onto the academic/non-profit entities they funded.
When we venture into what people can create with this technology, we face porn, Deepfake propaganda, political censorship and much more. Midjourney had been used to generate images of school shootings, gore, and war photos. As of November 2022, Nightcafe has an NSFW Discord channel full of oversized breasts, unrealistically tiny waists, and many more forms of the objectification of women. I can tell these images were created by cis-hetero male users judging by the number of maids, warriors and elves outfitted in the resulting images. There are more alarming ethical issues facing text-to-image AI use cases. A website called Porn Pen allows users to customise the appearance of nude AI-generated models — all of which are women — using tags like “lingerie model,” “celebrity,” and ethnicities (e.g., “Scandanavian” and “Latina”) Buttons capture “models” from the front, back, or side, and change the appearance of the generated photo. Porn Pen allows you to generate images of realistic-looking models in “blow job”, “doggy-style”, or “missionary” positions. Porn Pen functions like “This Person Does Not Exist,” only it’s NSFW.
Definitions of unsafe are very broad and could fall into censorship debates. DALL.E minimises this exposure by removing the most explicit content from the training data. You can argue that this is censorship too. Stable Diffusion’s license is a form of social agreement as it relies on users self-regulating their conduct. However, users cannot be penalised if ‘they fail to comply with the license. Stable Diffusion’s software is hard-wired with an adjustable AI tool, “Safety Classifier”, to remove undesirable outputs. However, Stable Diffusion acknowledges that ‘[t]he parameters of this tool] can be readily adjusted’. OpenAI has acknowledged and worked to mitigate racial and gender biases in its image training set. They also avoid generating sexual/violent content or recognisable celebrities and trademarked characters.
Images that are explicitly pornographic can also be of artistic value. There is a larger argument of censorship at stake; however, there is a clear line from the objectification of women to violence against women. When it comes to censorship, more prominent players, such as governments, can censor user-generated content to suit their needs. Some governments are already removing the ability to create “politically sensitive” content from platforms. E.g. Tiananmen square has been removed from the Chinese-backed image generation app. China’s leading text-to-image synthesis model, Baidu’s ERNIE-ViLG, censors political text such as “Tiananmen” Baidu also censors names of political leaders, reported Zeyi Yang for MIT Technology Review.
What started as a short piece on a cute story I created with the help of text-to-image AI has turned into this long read on what makes text-to-image AI so cool, how it works, and where we should be concerned as it affects even the regular person on an individual, and institutional levels. The bottom line is that all these platforms are still trying to figure out safety, but they have conveniently glossed over ethics and intellectual property rights. The technology is spreading faster than anyone can regulate. For users, it is incredible to have interfaces that can help us express ourselves. These interfaces will improve in understanding our prompts and generate more nuanced results. The margin for being a trained illustrator, artist or painter is too high; this makes text-to-image AI so cool. One of the biggest non-technical arguments about text-to-image AI is that it will put artists like graphic designers and illustrators out of work. That is not true. It is a reductionist way of thinking. It cannot replace trained expertise. Most of my designer friends have been using these platforms to optimize their workflows. On DALL.E 2, user feedback inspired them to build features like Outpainting, which lets users continue an image beyond its original borders and create bigger images of any size. New markets are springing up with secondary products such as PromptBase, where you can buy and sell prompts for machine learning systems.
Of course, the prominent tech actors will be waiting to swoop in and exploit the user-generated content. Go creator economy!!!! Tech actors have already started unrolling AI-generated media in audio and video formats. I plan to complete Odogwu’s story and turn it into a wild experimental film. I got sidetracked by delving into AI and writing this article. While generating images is fun, the structural issues such as copyright, reproduction of violence, censorship and the big tech ripping us all off are much more significant and scary. Having said all that, I won’t stop pursuing my story and creating with these platforms. I’m here for where they choose to go next.
Sources
- DALL·E: Creating Images from Text
- If AI is Killing Photography, Does That Mean Photography Killed Painting?
- Exploring 12 Million of the 2.3 Billion Images Used to Train Stable Diffusion’s Image Generator
- AI Data Laundering: How Academic and Nonprofit Researchers Shield Tech Companies from Accountability
- Dalle-2 Preview: Risks and Limitations
- Anyone can use this AI art generator — that’s the risk
- AI can now create any image in seconds, bringing wonder and danger
- Startup Behind AI Image Generator Stable Diffusion Is In Talks To Raise At A Valuation Up To $1 Billion
- China’s most advanced AI image generator already blocks political content
- How AI creates photorealistic images from text
- Stable Diffusion – AI Art for the Masses
- Stable Diffusion model card