Previously in this series of articles, we’ve discovered that the debugging symbols file PSX.SYM doesn’t match the PSX.EXE artifact we have on hand. Undaunted, we’ve created a placeholder program PSX.SYM.elf that matches the layout of PSX.SYM and then loaded the debug data on top of it inside Ghidra. In this part, we’ll complete the rescue of the debugging data by migrating it to PSX.EXE, a tangible executable.

Just version track it, right?

So, let’s fire up a Version Tracking session with PSX.SYM.elf as the source program and PSX.EXE as the destination program, click on the magic wand icon to run the correlators and…

…there are no matches.

We have a source program with no bytes but metadata and a destination program with bytes but no metadata. With no overlap between the two, Ghidra’s built-in correlators (which work with program bytes, references and symbol names) can’t match anything.

In theory, at this point the only thing we can do with an empty Version Tracking session is to create manual matches by hand, one by one. The source program contains thousands of individual pieces of informations, so that’s not a practical option.

Improving the dataset

First order of business is to improve the quality of the metadata inside PSX.EXE.elf. While loading PSX.SYM on top of it did bring a lot of information, there are quite a lot of blind spots left in there. The Psy-Q SDK does not provide any debugging information about its functions and data, but even the game code itself isn’t fully covered.

Before that, running Ghidra’s built-in CreateFunctionsFromSelection.java script on the .text section of PSX.EXE will discover and create the functions that Ghidra’s analyzers missed. After all, we can’t match functions that do not exist in the destination program.

Scavenging functions out of raw labels

We have about 450 functions declared inside PSX.SYM, but there are many more functions than that in this executable. Normally, the built-in CreateFunctionsFromSelection.java script can automatically create functions from dead subroutines, but our placeholder program has no instructions for Ghidra to detect subroutines, so we need another approach.

What we do have for everything are typeless symbols, raw addresses with nothing but names. If the .text section contains only functions and there are no overlaps or holes, then we can take a symbol and create a function that spans until the next one.

That can be scripted like so:

//Converts a range of symbols to functions, useful for naked symbols inside .text sections.
//@author Jean-Baptiste Boric
//@category Functions
//@keybinding
//@menupath
//@toolbar

import java.util.HashMap;
import java.util.Map;

import ghidra.app.script.GhidraScript;
import ghidra.app.services.*;
import ghidra.app.util.bin.*;
import ghidra.program.model.address.*;
import ghidra.program.model.block.*;
import ghidra.program.model.data.*;
import ghidra.program.model.data.ISF.*;
import ghidra.program.model.lang.*;
import ghidra.program.model.lang.protorules.*;
import ghidra.program.model.listing.*;
import ghidra.program.model.mem.*;
import ghidra.program.model.pcode.*;
import ghidra.program.model.reloc.*;
import ghidra.program.model.relocobj.*;
import ghidra.program.model.scalar.*;
import ghidra.program.model.symbol.*;
import ghidra.program.model.util.*;
import ghidra.util.exception.*;

public class SymbolsToFunctions extends GhidraScript {
	public AddressFactory addressFactory;
	public FunctionManager functionManager;
	public SymbolTable symbolTable;

	@Override
	public void run() throws Exception {
		addressFactory = currentProgram.getAddressFactory();
		functionManager = currentProgram.getFunctionManager();
		symbolTable = currentProgram.getSymbolTable();

		for (AddressRange selectionRange : currentSelection.getAddressRanges()) {
			Map<String, AddressSetView> symbolRanges = getSymbolRanges(selectionRange);

			for (Map.Entry<String, AddressSetView> symbolRange : symbolRanges.entrySet()) {
				String name = symbolRange.getKey();
				AddressSetView body = symbolRange.getValue();

				Function func = functionManager.getFunctionAt(body.getMinAddress());
				if (func == null) {
					writer.println(String.format("Creating function %s with body %s", name, body));
					functionManager.createFunction(name, body.getMinAddress(), body, SourceType.IMPORTED);
				} else {
					func.setName(name, SourceType.IMPORTED);
					func.setBody(body);
				}
			}
		}
	}

	private Map<String, AddressSetView> getSymbolRanges(AddressRange addressRange) {
		Map<String, AddressSetView> symbolRanges = new HashMap<>();

		String previousSymbolName = null;
		Address previousSymbolAddress = null;

		for (Symbol symbol : symbolTable.getPrimarySymbolIterator(addressFactory.getAddressSet(addressRange.getMinAddress(), addressRange.getMaxAddress()), true)) {
			String symbolName = symbol.getName();
			Address symbolAddress = symbol.getAddress();

			if (previousSymbolAddress != null) {
				symbolRanges.put(previousSymbolName, addressFactory.getAddressSet(previousSymbolAddress, symbolAddress.previous()));
			}

			previousSymbolName = symbolName;
			previousSymbolAddress = symbolAddress;
		}

		if (previousSymbolAddress != null) {
			symbolRanges.put(previousSymbolName, addressFactory.getAddressSet(previousSymbolAddress, addressRange.getMaxAddress()));
		}

		return symbolRanges;
	}
}

A fair amount of Psy-Q SDK functions like CdSearchFile use a bunch of private static functions that follow the public function but do not appear in the symbol table. Since we don’t have the data to reconstruct this, this script will create a big CdSearchFile function instead of one small CdSearchFile function and a bunch of follow-up static functions.

Running this script on a small selection of addresses for testing yields the following output:

SymbolsToFunctions.java> Running...
Creating function cd_tell with body [[8008ae30, 8008ae67] ]
Creating function cd_seek with body [[8008ad8c, 8008ae2f] ]
Creating function cd_read with body [[8008aea0, 8008b02b] ]
Creating function cd_getsize with body [[8008ae68, 8008ae9f] ]
SymbolsToFunctions.java> Finished!

In total, over 640 functions were recovered in this manner. Sure, they don’t have signatures, they don’t have bytes and some of them don’t even have the right size, but we can fix at least the first problem.

We have the means to make you signed

Just because the SDK doesn’t provide debugging symbols doesn’t mean we can’t recover information through different means. The Psy-Q headers provide symbol names and type information, at least for the public interfaces. The PlayStation executable loader includes data type archives processed from the SDK header files, which we can apply using this script:

//Applies function signatures held inside data type archives.
//@author Jean-Baptiste Boric
//@category Functions
//@keybinding
//@menupath
//@toolbar

import java.util.Arrays;
import java.util.ArrayList;
import java.util.List;
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Pattern;

import ghidra.app.script.GhidraScript;
import ghidra.app.services.*;
import ghidra.app.util.bin.*;
import ghidra.program.model.address.*;
import ghidra.program.model.block.*;
import ghidra.program.model.data.*;
import ghidra.program.model.data.ISF.*;
import ghidra.program.model.lang.*;
import ghidra.program.model.lang.protorules.*;
import ghidra.program.model.listing.*;
import ghidra.program.model.mem.*;
import ghidra.program.model.pcode.*;
import ghidra.program.model.reloc.*;
import ghidra.program.model.relocobj.*;
import ghidra.program.model.scalar.*;
import ghidra.program.model.symbol.*;
import ghidra.program.model.util.*;
import ghidra.util.exception.*;

public class ApplyFunctionSignaturesFromDataTypeArchive extends GhidraScript {
	public static final Pattern PATTERN = Pattern.compile(".+/functions/.+");

	public AddressFactory addressFactory;
	public FunctionManager functionManager;
	public SymbolTable symbolTable;

	@Override
	public void run() throws Exception {
		addressFactory = currentProgram.getAddressFactory();
		functionManager = currentProgram.getFunctionManager();
		symbolTable = currentProgram.getSymbolTable();

		DataTypeManager dtm = getDataTypeManager("psyq470");
		Map<String, FunctionSignature> signatures = getFunctionSignatures(dtm, PATTERN);
		List<Function> functions = new ArrayList<>();
		functionManager.getFunctions(currentSelection, true).forEachRemaining(functions::add);
		applyFunctionSignatures(functions, signatures);
	}

	private void applyFunctionSignatures(List<Function> functions, Map<String, FunctionSignature> signatures) throws Exception {
		for (Function function : functions) {
			FunctionSignature signature = signatures.get(function.getName());
			if (signature == null || signature.isEquivalentSignature(function.getSignature())) {
				continue;
			}

			function.setReturnType(signature.getReturnType(), SourceType.IMPORTED);
			List<ParameterImpl> parameters = Arrays.asList(signature.getArguments()).stream().map(parameter -> {
				try {
					return new ParameterImpl(parameter.getName(), parameter.getDataType(), currentProgram);
				} catch (Exception ex) {
					throw new RuntimeException(ex);
				}
			}).toList();
			function.replaceParameters(parameters, Function.FunctionUpdateType.DYNAMIC_STORAGE_ALL_PARAMS, true, SourceType.IMPORTED);
			function.setVarArgs(signature.hasVarArgs());

			writer.println(String.format("%s> Processed function %s", function.getEntryPoint(), function.getName()));
		}
	}

	private DataTypeManager getDataTypeManager(String name) {
		DataTypeArchiveService dataTypeArchiveService = state.getTool().getService(DataTypeArchiveService.class);
		for (DataTypeManager dtm : Arrays.asList(dataTypeArchiveService.getDataTypeManagers())) {
			if (dtm.getName().equals(name)) {
				return dtm;
			}
		}

		throw new RuntimeException("Couldn't find data type archive " + name);
	}

	private Map<String, FunctionSignature> getFunctionSignatures(DataTypeManager dtm, Pattern pattern) {
		Map<String, FunctionSignature> signatures = new HashMap<>();
		List<DataType> dataTypes = new ArrayList<>();
		dtm.getAllDataTypes(dataTypes);

		for (DataType dataType : dataTypes) {
			if (!(dataType instanceof FunctionSignature)) {
				continue;
			}

			FunctionSignature signature = (FunctionSignature) dataType;
			if (pattern.matcher(dataType.getDataTypePath().toString()).matches()) {
				signatures.put(signature.getName(), signature);
			}
		}

		return signatures;
	}
}

This script matches the names of functions to the signatures held inside data type archives and applies the signature to the function if a match is found. We recover over 350 untyped function signatures this way, from the functions we’ve reconstructed from raw labels.

Looking for correlations

Now that we have cleaned up our placeholder executable as much as possible, it’s time to Version Track it for all its worth. Since the built-in Ghidra correlators can’t match anything, we need to think outside the box. Luckily for us, there are some patterns in the metadata that we can take advantage of.

Assuming the source and destination programs are reasonably close, it is likely that they share global variables and functions with identical sizes, at least to some extent. What is also likely is that they share the same ordering of these global variables and functions, especially within the same original object file. Therefore, if we were to identify an identical pair of spots between the two programs, it’s likely that the data before and after it is also identical.

So we need to identify runs of similarly-sized stuff between our two programs in order to correlate them. I could fork Ghidra and implement new types of correlators cleanly, or I could also just lazily script my way out, which sounds far more expedient for an experimental approach.

Shaping up correlations

Functions are fairly obvious and sizable, so if their sizes have a lot of variation then the patterns should stand out. Since we don’t really know where we’re going with this stuff, we can start with a prototype that merely outputs the best match found:

//Correlates the shapes of functions
//@author Jean-Baptiste Boric
//@category Version Tracking
//@keybinding
//@menupath
//@toolbar

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;

import ghidra.app.services.*;
import ghidra.app.util.bin.*;
import ghidra.feature.vt.GhidraVersionTrackingScript;
import ghidra.program.model.address.*;
import ghidra.program.model.block.*;
import ghidra.program.model.data.*;
import ghidra.program.model.data.ISF.*;
import ghidra.program.model.lang.*;
import ghidra.program.model.lang.protorules.*;
import ghidra.program.model.listing.*;
import ghidra.program.model.mem.*;
import ghidra.program.model.pcode.*;
import ghidra.program.model.reloc.*;
import ghidra.program.model.relocobj.*;
import ghidra.program.model.scalar.*;
import ghidra.program.model.symbol.*;
import ghidra.program.model.util.*;
import ghidra.util.exception.*;

public class MultipleFunctionShapeCorrelator extends GhidraVersionTrackingScript {
	public static final int CONVOLUTION_RANGE = 7;

	@Override
	public void run() throws Exception {
		openVersionTrackingSession("/VT_PSX.SYM.elf_PSX.EXE");
		matchFunctions(getFunctionsOf(sourceProgram), getFunctionsOf(destinationProgram));
	}

	public List<Function> getFunctionsOf(Program program) {
		List<Function> functions = new ArrayList<>();
		FunctionManager functionManager = program.getFunctionManager();
		functionManager.getFunctions(true).forEachRemaining(functions::add);
		return functions;
	}

	public List<Integer> getFunctionSizes(List<Function> functions) {
		return functions.stream().map(f -> (int)f.getBody().getNumAddresses()).collect(Collectors.toList());
	}

	public void matchFunctions(List<Function> sourceFunctions, List<Function> destinationFunctions) {
		List<Integer> sourceSizes = getFunctionSizes(sourceFunctions);
		List<Integer> destinationSizes = getFunctionSizes(destinationFunctions);

		for (int i = 0; i < sourceSizes.size(); i++) {
			Function sourceFunction = sourceFunctions.get(i);
			int bestMatch = 0;
			float bestScore = 0;
			float secondBestScore = -1;

			for (int j = 0; j < destinationSizes.size(); j++) {
				float score = 1;
				for (int k = -CONVOLUTION_RANGE; k <= CONVOLUTION_RANGE; k++) {
					int a = sourceSizes.get(Math.floorMod(i+k, sourceSizes.size()));
					int b = destinationSizes.get(Math.floorMod(j+k, destinationSizes.size()));
					score *= 1 - (Math.abs(a - b) / (float)(a + b)) * Math.exp(-Math.abs(k));
				}

				if (score > bestScore) {
					secondBestScore = bestScore;
					bestScore = score;
					bestMatch = j;
				}
			}

			Function destinationFunction = destinationFunctions.get(bestMatch);
			float confidence = (bestScore - secondBestScore) / bestScore;
			writer.println(String.format("[%30s; %30s] best score %1.3f second best %1.3f confidence %1.3f", sourceFunction, destinationFunction, bestScore, secondBestScore, confidence));
		}
	}
}

The script treats the function sizes of both programs as vectors, then tries to find the best pair match by scoring between 0 and 1 how closely the two functions and their neighbors fit together. If we have a score of 1, then the match is perfect: both the functions and their neighbors have the same size. We also compute a confidence score based on the best and second-best scores, the wider the gap between the two the better.

There’s probably an elegant way to modelize this properly and formally, with fancy-pants words like cross-correlation, Gaussian or whatever. Alas, I’m an engineer, not a mathematician, so I’m just winging the scoring with a made-up formula that looks pretty.

As long as the correct match often has the best score, I don’t care if I’m not leveraging this data set to the fullest extent possible.

The script outputs this kind of data on the console:

Source function Destination function Score Confidence
CreateHumanoid FUN_800247fc 0.770 0.107
KillAllHumanoid FUN_800249b8 0.913 0.249
KillHumanoid FUN_80024a00 0.968 0.365
ControlAllHumanoid FUN_80024af4 0.988 0.297
ControlHumanoid FUN_80024b68 0.996 0.387
DefaultActionHumanoid FUN_80024de0 0.998 0.555
GetHumanoid FUN_800256f0 1.000 0.457
MoveHumanoid FUN_80025764 1.000 0.291

Given the lack of mathematical soundness of the formula, we can’t conclude much just by looking at the raw score and confidence numbers. However, we can take a peek at the destination functions and judge if the source function is a match. For this subset, after manual inspection every pair turns out to be correct, so it seems like this empirical method has promising results.

Poor’s man matches

We have the data that pairs functions from the source program to the destination program, but writing text on the output stream doesn’t put it inside Ghidra’s database. Luckily, we can hijack the manual match set present inside any Version Tracking session for our needs.

It’s a bit tricky to add a match in an idempotent manner, to ensure that running the script multiple times doesn’t insert duplicate matches, but not impossible:

private void addMatch(AddressSetView source, AddressSetView destination, float similarity, float confidence, VTAssociationType associationType) {
    VTMatchSet manualMatchSet = vtSession.getManualMatchSet();
    VTMatchInfo matchInfo = new VTMatchInfo(manualMatchSet);
    matchInfo.setSourceAddress(source.getMinAddress());
    matchInfo.setDestinationAddress(destination.getMinAddress());
    matchInfo.setSourceLength((int) source.getNumAddresses());
    matchInfo.setDestinationLength((int) destination.getNumAddresses());
    matchInfo.setSimilarityScore(new VTScore(similarity));
    matchInfo.setConfidenceScore(new VTScore(Math.pow(10, confidence)));
    matchInfo.setAssociationType(associationType);
    matchInfo.setTag(null);

    if (!manualMatchSet.getMatches().stream().anyMatch(match -> 
        match.getSourceAddress().equals(matchInfo.getSourceAddress()) &&
        match.getDestinationAddress().equals(matchInfo.getDestinationAddress()) &&
        (match.getSourceLength() == matchInfo.getSourceLength()) &&
        (match.getDestinationLength() == matchInfo.getDestinationLength()) &&
        SystemUtilities.isEqual(match.getSimilarityScore(), matchInfo.getSimilarityScore()) &&
        SystemUtilities.isEqual(match.getConfidenceScore(), matchInfo.getConfidenceScore()) &&
        SystemUtilities.isEqual(match.getAssociation().getType(), matchInfo.getAssociationType())
    )) {
        manualMatchSet.addMatch(matchInfo);
    }
}

After replacing the console logging with a call to this method, we get finally what we were after: a Version Tracking session with a whole bunch of matches.

Not bad for a source program that has no bytes.

What about data?

In theory, the same operation should be doable with data, but I do not think this is a viable option for now. The function correlator is already having quite a lot of trouble matching objects that are quite distinctive: simply put, I think there’s too little overlapping information between the source and destination program to make it work.

That being said, it’s still possible to match up runs of identical data between PSX.SYM.elf and PSX.EXE by hand, so the debugging information can still be leveraged there. Just make a selection in the source and destination programs, invoke the Create Manual Match From Tool action and…

…oh come on, I’m not even doing something that’s an offense to Cthulhu right now, I’m just trying to create a manual match with data!

I think this wasn’t implemented before because the built-in correlators probably do a good enough job with bytes and references to not require the creation of manual matches. Too bad my source program doesn’t have bytes or references.

I almost decided to modify Ghidra’s manual match action in anger, but relented after remembering my previous experience maintaining my own Ghidra fork. Instead, I’ll script this one out and maybe worry about making a proper pull request to Ghidra’s source code later:

//Create a manual match for data
//@author Jean-Baptiste Boric
//@category Version Tracking
//@keybinding
//@menupath
//@toolbar

import java.lang.reflect.*;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;

import ghidra.app.services.*;
import ghidra.app.util.bin.*;
import ghidra.feature.vt.api.main.*;
import ghidra.feature.vt.gui.plugin.*;
import ghidra.feature.vt.GhidraVersionTrackingScript;
import ghidra.framework.plugintool.*;
import ghidra.program.model.address.*;
import ghidra.program.model.block.*;
import ghidra.program.model.data.*;
import ghidra.program.model.data.ISF.*;
import ghidra.program.model.lang.*;
import ghidra.program.model.lang.protorules.*;
import ghidra.program.model.listing.*;
import ghidra.program.model.mem.*;
import ghidra.program.model.pcode.*;
import ghidra.program.model.reloc.*;
import ghidra.program.model.relocobj.*;
import ghidra.program.model.scalar.*;
import ghidra.program.model.symbol.*;
import ghidra.program.model.util.*;
import ghidra.program.util.*;
import ghidra.util.*;
import ghidra.util.exception.*;

public class CreateManualMatchData extends GhidraVersionTrackingScript {
	@Override
	public void run() throws Exception {
		openVersionTrackingSession("/Miscellanea/VT_PSX.SYM.elf_PSX.EXE (manual matches)");
		PluginTool manager = state.getTool();
		Plugin subordinate = manager.getManagedPlugins().stream().filter(p -> p.getClass().getSimpleName().equals("VersionTrackingSubordinatePluginX")).findFirst().get();
		Field subToolManagerField = subordinate.getClass().getDeclaredField("this$0");
		subToolManagerField.setAccessible(true);
		VTSubToolManager subToolManager = (VTSubToolManager) subToolManagerField.get(subordinate);
		Method sourceMethod = subToolManager.getClass().getDeclaredMethod("getSourceLocation");
		Method destinationMethod = subToolManager.getClass().getDeclaredMethod("getDestinationLocation");
		sourceMethod.setAccessible(true);
		destinationMethod.setAccessible(true);
		ProgramLocation sourceLocation = (ProgramLocation) sourceMethod.invoke(subToolManager);
		ProgramLocation destinationLocation = (ProgramLocation) destinationMethod.invoke(subToolManager);
		CodeUnit sourceCodeUnit = sourceProgram.getListing().getCodeUnitContaining(sourceLocation.getAddress());
		CodeUnit destinationCodeUnit = destinationProgram.getListing().getCodeUnitContaining(destinationLocation.getAddress());
		AddressSetView source = sourceProgram.getAddressFactory().getAddressSet(sourceCodeUnit.getMinAddress(), sourceCodeUnit.getMaxAddress());
		AddressSetView destination = destinationProgram.getAddressFactory().getAddressSet(destinationCodeUnit.getMinAddress(), destinationCodeUnit.getMinAddress().add(source.getNumAddresses() - 1));

		if (sourceLocation == null || destinationLocation == null) {
			Msg.showInfo(getClass(), null, "Cannot Create Match", "There must be a source and destination selection");
			return;
		}

		addMatch(source, destination, 1.0f, 0.0f, VTAssociationType.DATA);
	}

	private void addMatch(AddressSetView source, AddressSetView destination, float similarity, float confidence, VTAssociationType associationType) {
		VTMatchSet manualMatchSet = vtSession.getManualMatchSet();
		VTMatchInfo matchInfo = new VTMatchInfo(manualMatchSet);
		matchInfo.setSourceAddress(source.getMinAddress());
		matchInfo.setDestinationAddress(destination.getMinAddress());
		matchInfo.setSourceLength((int) source.getNumAddresses());
		matchInfo.setDestinationLength((int) destination.getNumAddresses());
		matchInfo.setSimilarityScore(new VTScore(similarity));
		matchInfo.setConfidenceScore(new VTScore(Math.pow(10, confidence)));
		matchInfo.setAssociationType(associationType);
		matchInfo.setTag(null);

		if (!manualMatchSet.getMatches().stream().anyMatch(match -> 
			match.getSourceAddress().equals(matchInfo.getSourceAddress()) &&
			match.getDestinationAddress().equals(matchInfo.getDestinationAddress()) &&
			(match.getSourceLength() == matchInfo.getSourceLength()) &&
			(match.getDestinationLength() == matchInfo.getDestinationLength()) &&
			SystemUtilities.isEqual(match.getAssociation().getType(), matchInfo.getAssociationType())
		)) {
			manualMatchSet.addMatch(matchInfo);
		}
	}
}

By the sheer amount of Java reflection going on in this script, you can probably tell I was getting rather pissed off trying to accomplish this task. I’m scripting the crap out of Ghidra here, that’s not how you’re supposed to do it.

Anyways, this script works similarly to the Create Manual Match From Tool action. Click a spot on the source and destination programs, it will create a match from the source code unit to an equally-sized region on the destination.

End result

The home-built function correlator worked very well for game code covered by debugging symbols, but it struggles in the following situations:

  • Missing static functions in the source dataset, which throws off the sequence of function sizes to compare ;
  • Long runs of identically-sized functions, which doesn’t provide any entropy within the correlation window.

Creating manual matches for the misidentified functions and the bulk of the data sections took a bit of time, but at the end the destination program is fairly thoroughly annotated, at least for the game and public SDK parts. The fact that the source and destination programs were very close made the process reasonably straightforward, but some of the matches did require a lot of inferring to figure out.

Had I attempted to version track directly onto a more distantly related artifact, it is likely that the source program being a placeholder with no initialized bytes or references would’ve made identifying matches extremely difficult.

Conclusion

Despite the built-in Ghidra correlators being unable to work with PSX.SYM.elf, our placeholder program created in the last part, we’ve managed to make matches inside Version Tracking against PSX.EXE through out-of-the-box thinking and generous (ab)use of scripting, rescuing the valuable debugging data from a lost artifact onto a real program.