The Models (2025), dmstfctn

The Models (2025) is an interactive installation by artist duo dmstfctn exploring the improvisational nature of generative AI, and its sometimes unexpected and nonsensical character.

Inspired by the Italian theatrical tradition of Commedia dell’Arte, the installation consists of a year’s worth of improvised theatrical scenes generated with a system of text and voice AI models inside a 3D simulation running on a video game engine. Each scene features a cast inspired by traditionally friendly, servile, lying, or antagonistic Commedia masks — Arlecchino, Balanzone, Brighella, Colombina, Pantalone and Pulcinella.

Surfacing the ways in which Large Language Models can exhibit similar behaviours, a system of four generative AI models - designed by the artists and run on the Leonardo Supercomputer - generates dialogues for the masks, surfacing the ways in which Large Language Models can exhibit similar behaviours. Each scene includes two masks, a controversial theatrical prop suggestive of superstition, folklore, or popular falsehoods, and a painted backdrop recovered from a XX century Italian puppet theatre collection.

Audience members select the masks and prop for each scene using their phones, effectively prompting the AI system and triggering the 3D simulation to generate a unique performance. The models’ output (the masks’ dialogues) is at once improvised and informed by a ‘knowledge’ of Commedia archetypes rooted in literature and popular media - emerging through statistical resonance with patterns in the models’ training data. Shared by models and audiences alike, this knowledge provides a foundation for improvisation, confabulation, mistakes, jokes and surprise.

AI Tendencies

The work focuses on three tendenciesread more observed in generative AI, and particularly in Large Language Models tuned for use as chatbots or assistants, that lead to surprising, unexpected, or nonsensical outputs. The installation surfaces these tendencies through its improvised theatrical scenes allowing audiences to explore, experience, and reflect on them.

Lying

Responses generated by LLMs are not always accurate or true, but will be presented as if they are, making it appear that the AI is imagining (or hallucinating) the response. In 2023 a mathematical study suggested that “hallucinations may be an intrinsic property of GPT models”, and a 2022 paper suggested that the larger a model, the more likely it was to repeat human misonceptions such as breaking a mirror bringing bad luck.

Being servile or overly friendly

Responses generated by LLMs may include flattery of the user, or tend towards agreeing with their stated beliefs, even in the face of potential counter arguments or evidence.

Sycophancy may stem from the fine-tuning step of AI assistants - a 2024 paper suggested that “human feedback may also encourage model responses that match user beliefs over truthful ones”. Similarly to the repetition of misconceptions, larger models were shown to be more sycophantic in relation to users’ political questions in a 2022 study

Being Antagonistic

An LLM may do the opposite of what’s asked, or rather generate a response counter to the one intended or expected from a prompt. The Waluigi Effect describes this tendency and hypothesises some reasons for it, including that fine-tuning a model to exhibit certain characteristics (i.e. friendliness or helpfulness) makes it easier to elicit the opposite, and that a large amount of text that will be in the LLM’s training data contains protagonist-antagonist tropes, making it likely that the LLM will generate text in that style.

A Scene

Balanzone and Pantalone in Venice with the moon

Arclecchino and Colombina in a wintry village with a snow globe

The Models takes place in a virtual theatre, in which characters on a small stage act out these tendencies through generated scenes. Each character is associated with a different tendency and in each scene a pair of characters are placed in a particular setting, with an object or prop. Each element is essentially a proxy for a block of text that will be passed to a modular LLM prompt to generate the scene’s underlying script - the setting and prop inform characters’ dialogue, and they respond in line with their defined role.

Characters

Lying AI

Balanzone confabulates and uses malapropisms, doesn't get it right but says it very convincingly. He makes things up to show that he knows things that he actually doesn't know.

Pulcinella lies, makes things up, has ulterior motives, and is cunning.

Friendly AI

Arlecchino is kind, too kind, he obsequiously agrees with others, like a sycophant.

Colombina is kind and honest, friendly and servile.

Antagonistic AI

Brighella is deceitful, a wolf in sheep's clothing, antagonistic. He wants to do the opposite of what he should, or of what is asked of him.

Pantalone is stingy, insidious, and antagonistic. He does what he wants, not what others want.

Props

There are 14 props. Some refer to common misconceptions or controversial topics whilst others refer to Commedia dell’arte Lazzi - set piece jokes that could be drawn on by actors to inject humour or drive a scene in a new direction. Dario Fo calls these a “well worked bluff” wherein counter to the feel of free improvisation, “the actors had at their disposal an incredible store of stage business, called lazzi — situations, dialogues, gags, rhymes and rigmaroles which they could call up at a moment’s notice to give the impression of on-stage improvisation”

Backdrops

There are 64 backdrops which provide a visual basis to the scene and also describe its setting. For example:

A cave looking towards an exit. The water reflects on the walls of the cave. Full moon night

Venice, with the island of San Giorgio and the lagoon illuminated by the moon at night

Tree trunks as far as the eye can see occupy almost the entire scene. In the upper part is the beginning of the foliage

We are at the circus, there is a large toy face with yellow eyes and an open mouth, surrounded by flames!

A large inn, with columns and arches, tables and chairs, and a big wine barrel at the other end of the room. A staircase to the upper floor is visible through the arched passage

Serene night view of a castle with walls and towers, surrounded by a small canal, and with a drawbridge at the entrance

Technical Overview

The textual prompts derived from combining masks, props, settings and languages in every possible permutation result in 26,880 unique scenes, each with AI-generated dialogues, including speech bubbles and translated subtitles. This generation was run as scheduled batch processes using the Leonardo supercomputer.

For each script a modular prompt is assembled which generates the script output as text. This is parsed using a Python script into a JSON file and sent to: a text-to-audio model to generate voice (as audio files); an audio-to-text model to transcribe what is being actually said out loud; and an LLM to translate this transcription into subtitles - resulting in 26,880 unique folders with text and audio files.

When the installation is presented, the audience uses a web interface on their phones to select elements of a scene, resulting in 1 of 26,880 possible combinations. The right folder is retrieved in real-time by a system built in Unreal Engine, ensuring that the correct masks, prop, and backdrop are made visible on the virtual stage, that a title is shown, and that masks receive their associated lines. The same system also facilitates audience feedback in the form of throwing flowers and coins if a scene is liked, or tomatoes and eggs if disliked, which may lead to an abrupt ending of the scene.

Script Prompting

The first step in generating a scene is to generate a script. Through prompt engineeringread more, we developed a modular prompting system that combines the descriptions of masks, props and backdrops, as well as constrains around the desired output. We then sent the prompt to a system of models downloaded and running locally on the Leonardo Supercomputer.

The first model is the 72 billion parameter, instruction-tuned version of Qwen 2.5 and we run the prompting via Python using the Hugging Face transformers library which allows us to pass both a system prompt and a user prompt.

System prompt

The system prompt is the foundational set of instructions to the LLM and we use it to set up the role that the LLM will play by providing the characterisation of a script writer:

You are an avant-garde and entertaining Commedia dell'Arte writer. You specialise in short improvised scenes between two masks discussing a controversial setting and object.

As well as to provide compositional and stylistic rules that both counter the (instruct-tuned) model’s tendency to be overly verbose and structure the output in a way that is parsable by our system:

Do not recapitulate settings at the beginning of your output, and do not write things like 'Here is the Commedia dell'Arte scene requested:'

Vary the length of the lines, sometimes a word or two, sometimes a sentence.

Modular User Prompt

The user prompt is constructed from a modular template that takes each element of the scene (masks, prop, setting) and wraps it in instructional language to trigger the LLM to produce a script-like output rather than something else. It also further adds control to the way these bits of information may return or be used in the script:

Keep the masks’ characterisation throughout the scene, until the last line, maintaining it in interactions with the scene, objects and masks

A diagram visualising the structure and text blocks of a possible prompt (out of 26,880 combinations)

Example Output

**Title: The Enchanted Apple**

**Arlecchino:** (peering into the darkness) Ah, Balanzone, what a wondrous night! The moon casts its silvery glow upon these ancient stones.

**Balanzone:** (squinting) Indeed, Arlecchino, it’s a night fit for a king! Or perhaps a queen, if you prefer. (chuckles) But I suspect it’s more fitting for a scholar such as myself.

**Arlecchino:** (nodding eagerly) Oh, most certainly, Balanzone! A scholar of great repute, no doubt. What do you make of this cavern's reflections? They seem almost... mystical.

**Balanzone:** (puffing out his chest) Mystical, you say? Why, I believe we stand in the presence of a portal to another realm! (pauses) Or perhaps it’s just a very wet cave. But let’s not quibble over details.

...

Voice Generation

The voice generation was the most computationally intensive part of the project as each script features 50 to 60 lines, including stage directions spoken by the Directorread more. As such, a significant amount of the installation’s character comes from the masks’ voices, which lead a scene forward. These are obtained by generating audio files from single script lines using the text-prompted generative audio model Bark, and then pitching and speeding them up and down in real-time using Unreal Engine’s MetaSounds, as the scene is performed.

Although we used it for text-to-speech purposes, Bark is a general purpose audio model able to interpret a prompt and synthesise other sounds from it additional to voices, such as music, laughter, hesitation, etc. The model displays a tendency to filler words such as ‘umm’, ‘like’ or ‘uhh’, and also to veer off script adding contextually relevant or plausible sounding nonsenseread more, including screaming. The model’s output felt closest to amateur improvisation, and added a layer of unpredictability and comedy.

Speech Bubbles

Due to Bark’s unpredictable behaviour and tendency to add words, the generated audio files were passed through a speech-to-text model to obtain a definitive list of script lines. For this, we used Faster Whisper Large V2 which provided time-stamps for each spoken word, allowing us to include speech bubble for each mask’s line as they are spoken, helping audiences interpret the masks’ distorted voices. However, just as Bark can sometimes re-interpret text, whisper can mis-interpret voices. This lead to improvised acting that deviated significantlyread more from the original script.

Translated Subtitles

The final step in the AI generation process is to translate the lines to enable subtitles for non-english speaking audiences. In the interest of speed, a small multi-language LLM was used, the 8B parameter, instruct-tuned version of Llama 3.1.

The Stage

Most elements visible on stage, such as theatrical silhouettes, masks and props are drawn by hand, then digitised, 3D modelled and collected inside Unreal Engine.

The stage set in the simulation file is set up to both appear and function like one. Its architecture is an experiment in using the affordances of a game engine, with its collision and physics functions to produce real-time animation that is knowable and predictable enough to be used in an unsupervised installation, but with enough variance and character to feel like the improvisational theatre it represents.

There are on-stage and back-stage areas defined by collision volumes, with masks set up to behave differently depending on whether they’re visible, and target points for different stage positions. Rather than spawning items on stage, or having set animations, each scene is constructed by the directorread more moving between regions, picking up and placing objects and masks on stage. This means timing of actions is not definitive, but influenced by distance and collision between elements.

The simulation features systems that mimic theatrical stage devices, such as rotating spheres for lights, drawn curtains and moving backdrops. The latter have a significant impact on the visual feel of a scene, both because of the high proportion of the view they take up and because they set the lighting. Each backdrop is sorted into one of 7 categories including nighttime, festive, urban, natural, and interior, and this is used to trigger a different form of lighting and a different looping musical soundscape.

A central BaseSystem governs simulation states (including Construct, PlayBack, Conclude), coordinates subsystems, syncs with a server, and communicates with audience phones. A PlaybackSystem loads AI-generated scripts, guides the director’s setup, and feeds masks lines during each scene, controlling the movement of each masks’ multiple mouths. Different actions - turn, tumble and spin - are potentially triggered when a mask speaks a line.

AI Hardware

To produce the 26,880 dialogues using generative AI, we had access to the Leonardo Supercomputer in winter 2024, as part of the EU Digital Deal residency hosted by Sineglossa. We had 150,000 hours of high-power compute budget and the support of engineers from Cineca, the body overseeing the computer.

Leonardo’s main GPU and CPU areas are not exposed to the internet, and so could not be used real-time generation. This meant that all permutations of the script had to be generated in advance of showing the installation through batch processing, using local open-source models.

Leonardo operates through batch jobs which are split across nodes via a scheduler, with smaller jobs prioritised. As such the AI generation process was split into four stages: script writing and parsing, voice generation, voice transcription and timestamping, and translation. The generation process for all 26,880 scripts used up to 210 nodes of the Supercomputer, each equipped with 4x A100 NVIDIA GPUs, and took the equivalent of ~8 days of continuous compute time to complete. The generated scripts and associated audio files, totalling 400GB, were downloaded to an external HD in a custom format that can be referenced by the Unreal Engine simulation.

Simulation Hardware

The simulation builds to an executable file that runs on any Windows computer with a relatively up to date NVIDIA graphics card (RTX 3060+) and an internet connection. The computer is attached to the external HD containing all the script data, and a sound and display system. The simulation ratio can be adapted between 2:1 and 1:1.

Interface

Audiences can use their phone to select parameters for each scene via a web interface. Up to three people can choose at once and if there are more people present and connected then they get held in a queue until it’s their turn. Whilst the scene is unfolding, connected audience members can provide feedback to the characters by throwing items on stage - flowers and coins if they feel positively, eggs and tomatoes if not.

The interface also provides an overview of the installation and interaction, meaning that this does not have to happen within the simulation itself. The link between the two is the directorread more who breaks the fourth wall to encourage interaction during the scene.

Server

The interface is built as a single page web app and connects to a Node.js server which passes information between the interface and the simulation via websockets. The server also keeps track of all players connected in a queue, informing players how long they have until it’s their turn.

Credits

Concept, design and development: dmstfctn
HPC Data scientist: Luca Mattei
Leonardo HPC advisor: Antonella Guidazzoli
Additional Game development: Jenn Leung
Original soundtrack: Hero Image
Curation and production: Sineglossa
In collaboration with: CINECA Vision Lab, ART-ER.
Under the patronage of: Regione Emilia-Romagna, Fondazione Bruno Kessler.
Funded by: European Union – Next Generation EU
Thanks to: Museo dei Burattini di Budrio

Contents