
Lisa Basil

This page contains information about all the data we gathered when making this project. You'll be able to look at how we obtained our data, as well as samples of the resulting XML code we created with RegEx expressions.
Basically, our workflow for data-gathering went like this:
Online transcripts -> Text files -> XML files
Then we could use that structured data to analyze trends in the series like characters' numbers of lines. For more about that, check out the Analysis Page.

To begin with, we needed to decide on a source to glean our data from. Obviously we couldn't play all the games while copying each line! The source we settled on was the Ace Attorney franchise's Fandom page. A Fandom page is an online encyclopedia page about a movie, TV show, or game series, created and moderated entirely by fans. The Fandom site has transcripts of all of the chapters of the series. However, the HTML behind these transcripts was very intimidating to read, as you can see from this brief sample:

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<meta charset="UTF-8"/>
<title>Category:Transcripts | Ace Attorney Wiki | Fandom</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"1bf5989d5f11bb285c04add17f364fbf","wgCSPNonce":false,"wgCanonicalNamespace":"Category","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":14,"wgPageName":"Category:Transcripts","wgTitle":"Transcripts","wgCurRevisionId":141564,"wgRevisionId":141564,"wgArticleId":19444,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Ace Attorney"],"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgRelevantPageName":"Category:Transcripts","wgRelevantArticleId":19444,"wgIsProbablyEditable":true,"wgRelevantPageIsProbablyEditable":true,"wgRestrictionEdit":[],"wgRestrictionMove":[],"wgNoExternals":false,
"wgFandomQuizzesOnFepoEnabled":false,"wgFandomQuizzesGenAiQuizzesOnArticlesEnabled":false,"wgFandomQuizzesCommunityQuizzesOnArticlesEnabled":false,"wgPageLanguageHasWordBreaks":true,"egFacebookAppId":"112328095453510","wgDisableAnonymousEditing":false,"comscoreKeyword":"wikiacsid_games","quantcastLabels":",Genre.mystery,Genre.adventure,Genre.sim,Genre.puzzle,Genre.manga,Genre.thriller,Genre.shonen,Genre.seinen,Genre.anime,Genre.musical,Genre.drama,Genre.strategy,Genre.visual novel,Genre.comedy,Genre.crime,,,Media.movies,,Media.anime,Media.comics,Media.books,Theme.detective,Theme.police,Theme.heroes,Theme.japan,Theme.spy,,TV.funimation,TV.crunchyroll","wgCategorySelect":{"defaultNamespace":"Category","defaultNamespaces":"Category"},"wgEnableDiscussions":true,"viewTrackURL":
<link rel="stylesheet" href="/load.php?lang=en&modules=ext.fandom.ArticleInterlang.css%7Cext.fandom.CategoryPage.category-layout-selector.css%7Cext.fandom.CategoryPage.category-page3.css%7Cext.fandom.CreatePage.css%7Cext.fandom.Experiments.TRFC147%7Cext.fandom.FandomEmbedVideo.css%7Cext.fandom.GlobalComponents.CommunityHeader.css%7Cext.fandom.GlobalComponents.CommunityHeaderBackground.css%7Cext.fandom.GlobalComponents.GlobalComponentsTheme.light.css%7Cext.fandom.GlobalComponents.GlobalExploreNavigation.css%7Cext.fandom.GlobalComponents.GlobalFooter.css%7Cext.fandom.GlobalComponents.GlobalNavigation.css%7Cext.fandom.GlobalComponents.GlobalNavigationTheme.light.css%7Cext.fandom.GlobalComponents.RegistrationButtons.css%7Cext.fandom.GlobalComponents.StickyNavigation.css%7Cext.fandom.HighlightToAction.css%7Cext.fandom.Thumbnails.css%7Cext.fandom.ThumbnailsViewImage.css%7Cext.fandom.Uncrawlable.css%7Cext.fandom.bannerNotifications.desktop.css%7Cext.fandom.quickBar.css%7Cext.fandomVideo.css%7Cext.staffSig.css%7Cext.visualEditor.desktopArticleTarget.noscript%7Cskin.fandomdesktop.CargoTables-ext.css%7Cskin.fandomdesktop.Math.css%7Cskin.fandomdesktop.rail.css%7Cskin.fandomdesktop.rail.popularPages.css%7Cskin.fandomdesktop.styles%7Cvendor.tippy.css&only=styles&skin=fandomdesktop"/>
<script async="" src="/load.php?cb=20240220010657&lang=en&modules=startup&only=scripts&raw=1&skin=fandomdesktop"></script>
<meta name="ResourceLoaderDynamicStyles" content=""/>
<link rel="stylesheet" href="/load.php?lang=en&modules=site.styles&only=styles&skin=fandomdesktop"/>
<meta name="generator" content="MediaWiki 1.39.6"/>
<meta name="format-detection" content="telephone=no"/>
<meta name="twitter:card" content="summary"/>
<meta name="twitter:site" content="@getfandom"/>
<meta name="twitter:url" content=""/>
<meta name="twitter:title" content="Category:Transcripts | Ace Attorney Wiki | Fandom"/>
<meta name="twitter:description" content="English language transcripts for Ace Attorney episodes."/>
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes, minimum-scale=0.25, maximum-scale=5.0"/>
<link href="/wikia.php?controller=ThemeApi&method=themeVariables" rel="stylesheet"/>
<link rel="alternate" type="application/x-wiki" title="Edit" href="/wiki/Category:Transcripts?action=edit"/>
<link rel="shortcut icon" href=""/>
<link rel="search" type="application/opensearchdescription+xml" href="/opensearch_desc.php" title="Ace Attorney Wiki (en)"/>
<link rel="EditURI" type="application/rsd+xml" href=""/>
<link rel="license" href=""/>
<link rel="canonical" href=""/>
<meta property="fb:app_id" content="112328095453510" prefix="fb:"/>
<script type="application/ld+json">{"@context":"","@type":"Article","url":"","name":"Transcripts","headline":"Transcripts","mainEntity":{"@type":"Thing","url":"","name":"Transcripts","image":""},"about":{"@type":"Thing","url":"","name":"Transcripts","image":""},"author":{"@type":"Organization","url":"","name":"Contributors to Ace Attorney Wiki"},"publisher":{"@type":"Organization","name":"Fandom, Inc.","logo":{"@type":"ImageObject","url":""}},"abstract":"English language transcripts for Ace Attorney episodes.","image":"","thumbnailUrl":""}</script>
<meta property="og:type" content="article"/>
<meta property="og:site_name" content="Ace Attorney Wiki"/>
<meta property="og:title" content="Transcripts"/>
<meta property="og:url" content=""/>
<meta property="og:image" content=""/>
const useMaxDefaultContentWidth = Boolean();
const defaultContentWidth = useMaxDefaultContentWidth ? 'expanded' : 'collapsed';
let contentWidthPreference;
try {
contentWidthPreference = localStorage.getItem('contentwidth') || defaultContentWidth;
} catch (e) {
contentWidthPreference = defaultContentWidth;
if ( contentWidthPreference === 'expanded' ) {

Not pretty... Spoiler alert; there are over 2000 lines of code that we don't care about before the links to the individual transcripts appear. Once we've locked down where exactly the links we care about appear (in this case, their found in the div element with an id of "mw-parser-output"), I was able to download the contents of each transcript page using python code, viewable here (or here, edited from a newtfire file):

import bs4
import requests
import os
archive_url = ""
def get_files():

r = requests.get(archive_url)

soup = bs4.BeautifulSoup(r.content, 'html.parser')

div = soup.find('div', class_='category-page__members')

links = div.find_all('a', href=True)

base_url = ""

for link in links:
href = link['href']

absolute_url= base_url+href
print("All transcripts downloaded!")
def download_links(href):

file_name = href.split('/')[-1] + ".html"
print("Downloading file: " + file_name)

r = requests.get(href, stream = True)
workingDir = os.getcwd()
print("current working directory: " + workingDir)
fileDeposit = os.path.join(workingDir, 'corpus', file_name)

with open(fileDeposit, 'wb') as f:
for chunk in r.iter_content(chunk_size = 1024*1024):
if chunk:
print("Downloaded " + file_name)
if __name__ == "__main__":

get_files = get_files()

We knew from the outset that we wanted plain text files we could train a text-generation model on, and XML files we could analyze and glean data from. We could do this in two stages; First, make our text corpus, then create our XML corpus from the text files (creating XML files from the raw HTML transcript files proved to be far too complex).

The process of turning the HTML transcripts into text files was pretty simple, it mostly involved identifying the part of the page where the in-game text appeared (it was in the same section as the transcript links, the div element with the "mw-parser-output" id), and removing all the HTML remnants that didn't represent anything from the game. Importantly, some of the most iconic lines in the game, those being the 'Objection!'s and 'Take That's, and other lines of such nature, weren't represented by any text, rather they're represented by PNG's or GIF's, like the one seen to the right. To solve this, we carefully carved out each one of these special cut-in speech bubbles with precise XQuery commands going over the entire collection of HTML transcripts, with our future purge of the HTML remnants in mind. For a full record of the searches/replaces made in the process, you can read our markdown file that recorded them!

Objection speech bubble!

As for the process of turning these text files into XML files, the process was a bit buggy, but by identifying how the scripts were structured, we could wrap almost every line in its own element that identified the speaker in an attribute (Check out the markdown file cataloguing this here. The resulting files contained several bugs, but they were pretty simple to fix, we just had to go through the files one by one to see what errors were popping up, then fixing them with XQuery (The markdown for this process is here). It's a good thing the errors were as uniform as the code!
Here's a snippet of the resulting XML code:

<line speaker="Gumshoe">Look, a ladder!</line>
<line speaker="Phoenix">That's a "step"-ladder.</line>
<line speaker="Gumshoe">What's the difference? Looks like a normal ladder to me, pal.</line>
<line speaker="Phoenix"><thought>Surely everyone knows the difference... I mean, they're pretty ordinary objects...</thought></line>
<line speaker="Gumshoe">I've met plenty of guys like you, always picking on the smallest details. The vegetable store guy near my place does it all the time. He even corrects me when I ask for a head of lettuce. "That's a cabbage," he says. I'm telling you, they're the exact same thing!</line>
<line speaker="Phoenix">No they're not! They're completely different!</line>
<line speaker="Gumshoe">You have to plant both of them firmly in the ground before they can grow, don't you? Listen. You gotta take a step back and look at the bigger picture sometimes. Otherwise you could miss a really important clue. That's advice from a pro, pal!</line>
<line speaker="Phoenix"><thought>...The last person I need advice from is this guy in front of me.</thought></line>
<line speaker="Phoenix">This must be an old pan handle or something.</line>
<line speaker="Gumshoe">H-How do you know that!?</line>
<line speaker="Phoenix">Huh, what?</line>
<line speaker="Gumshoe">That was my nickname in junior high.</line>
<line speaker="Phoenix">What, "pan handle"?</line>
<line speaker="Gumshoe">I didn't have much money back then, pal. I used to bum stuff off the other kids sometimes, so they called me "Panhandler".</line>
<line speaker="Phoenix">... "Panhandler", huh? I can see that.</line>
<line speaker="Gumshoe">Thinking back, it's actually kind of a nice memory now.</line>

The data we've scraped up isn't perfect, there's remnants of the HTML transcripts' walkthrough-like structure, as well as some lines with speakers that aren't real characters, but the resulting corpora are comprehensive and all well-formed, a mighty feat considering how buggy certain steps in the process were!
Here's a series of download links to all our data corpusses, as well as the combined files containing all the data:

Back to Top