Artificial Intelligence in visual creation: crafting effective prompts

Artificial Intelligence (AI) is a field of computer science focusing on designing hardware and software systems capable of mimicking human behaviors. The goal is to equip machines with traits typically human, such as visual, spatiotemporal, and decision-making perceptions. AI can be assessed by analogy with human action and thought, where an intelligent system acts rationally to achieve the best possible outcome, considering context and available information.
The term "artificial intelligence" was coined by John McCarthy in 1956. Initially, AI research adopted a symbolic approach with expert systems, but in the 1960s and 1970s, it slowed due to computational power limitations. In the 1980s, AI saw a resurgence with a focus on Machine Learning, driven by artificial neural networks. In the twenty-first century, AI experienced exponential growth due to abundant data and widespread use of Deep Learning. Artificial neural networks, or simulated, are a computational model of deep learning inspired by the functioning of human biological neurons.

Today, AI is an integral part of everyday life, with virtual assistants like Siri and Alexa. It's employed in sectors such as industrial automation, autonomous driving, medicine, and financial trading. Despite successes, there are still ethical, transparency, and interpretability challenges to address. Generative techniques in AI learn from existing data and are designed to create new synthetic data resembling real ones. They can generate previously unseen data or model the complex probability distribution of data. Generative techniques aim to maximize likelihood, i.e., the probability of generating data following a distribution similar to the training data.

On January 5, 2021, OpenAI, a company specializing in artificial intelligence research and implementation, announced on its blog the ability to manipulate visual concepts through language: "We have trained a neural network called DALL·E, which creates images from text captions for a wide range of concepts expressible in natural language".
The name "DALL-E" is a play on words, referencing the surrealist artist Salvador Dalí and the fictional character Wall-E.
Significant advancements in Deep Learning have enabled the integration of Artificial Neural Networks into natural language processing. DALL-E, based on the GPT (Generative Pre-trained Transformer) architecture, possesses a diverse set of capabilities, including creating anthropomorphized versions of animals and objects, combining unrelated concepts coherently, rendering text, and applying transformations to existing images.

Like GPT-3, DALL·E is a transformer language model. It receives both text and image as a single data stream containing up to 1280 tokens. A token is any symbol from a discrete vocabulary; for humans, for example, each English letter is a token from an alphabet of 26 letters. DALL·E's vocabulary contains tokens for both text and image concepts. DALL·E is capable of creating images for a wide variety of phrases, and thanks to the easily accessible prompt book on the dallery.gallery website, which opens with ironic warnings about the non-veracity of the images to be generated, it is possible to structure a request in natural language in the best possible way to have a response that best meets our needs. This sort of manual provides valuable advice to improve the precision of the request: do you want a realistic image? A 3D drawing? An illustration?

In examining DALL-E, it's essential to consider that the model hasn't undergone traditional "teaching"; instead, it assimilated knowledge through studying a vast dataset comprising 650 million images. Consequently, there isn't an absolute manual for its usage, necessitating exploration of this tool's capabilities through practical application and direct use.
To start generating an image, you need to input a command or prompt composed of words up to a maximum of 400 characters, describing the image you want to generate. The prompt doesn't have a set length, and you can even use only emojis. Clearly, the more specific the prompt, the more the image will reflect what you had in mind.

To generate effective prompts, it's advisable to use industry-specific jargon. For instance, if you want a photograph, it's better to employ terminology from the photography world, such as close-up or fish-eye. When generating a photograph, it's important to ask questions and understand what we want to create, to include the following information within the prompt:

Describing a shooting context
Lighting: noon, sunset, dawn, flat lighting, shadows and silhouettes, illuminated primary subject, studio lighting; it's possible to define a light source such as "a lighthouse illuminating the beach, or colors illuminating a face.
Years and context:
cave paintings, ancient Egypt, decorative murals, Roman mosaics, Renaissance, Mannerism, Baroque, Neoclassicism, Realism, Art Nouveau, Impressionism, Art Deco, Dada, Abstract, Bauhaus, Cubism, Expressionism, Fauvism, Futurism, Orphism, Street Art.
Lens type settings: Fast shutter speed, slow shutter speed, motion blur, fisheye lens, micro/macro lens, wide-angle lens.
Frame composition:
- Define proximity: close-up, medium shot, long shot, extreme long/shot shot.
- Angling: overhead view, low angle, aerial view, titled frame.
Film type:
- Kodachrome: vibrant green and red
- Autochrome: muted yellow and green
- Lomography: maximum saturation images
- Polaroid
- Camera phone: reminiscent of early digital images
- CCTV: surveillance photos, in black and white
- Disposable camera: amateurish photography
- Daguerreotype: vintage style
- Camera obscura: pinhole photography
- Double exposure: combining two elements
- Cyanotype: blue and white photos
- Black and White: classic monochrome
- Redscale photography: red color dominant
- Infrared photography: renders plants pink
- Instagram Hipstamatic: faux retro style
- Contact sheet: combines multiple images
- Colour splash: highlights only one color
- Solarized: some colors are negative
- Anaglyph: 3D format

To generate an illustration, we can select the style, colors, and medium. For a 3D image of a building, it's preferable to use architectural terminology.

Through adjectives, colors (black and white, sepia), or dates, we can influence the style. A technique for generating images in a specific artistic or design style involves "unbundling" or the process of decomposition. For instance, if you want to create an image of a dining room in a cubist style, you can input "dining room in blue color in Picasso's cubist style." Alternatively, to achieve better results, ask ChatGPT to describe Picasso's cubist style and incorporate that description into the prompt.