Example: Loan Data
Intro
There are many potential applications of synthetic data in banking and finance where the nature of the data, being both personally and commercially sensitive, may rule out sharing real, identifiable data.
Here, we show how to use SqlSynthGen to generate a simple (uniformly random) synthetic version of the freely-available PKDD’99 dataset. This dataset contains 606 successful and 76 not successful loan applications.
Setup
The PKDD’99 dataset is stored on a MariaDB database, which means that we need a local MariaDB database to store the synthetic data.
MariaDB installation instructions can be found here.
We presume that you have a local server running on port 3306, with a user called myuser, a password mypassword and a database called financial.
$ mysql
MariaDB > create user 'myuser'@'localhost' identified by 'mypassword';
MariaDB > create database financial;
MariaDB > grant all privileges on financial.* to 'myuser'@'localhost';
MariaDB > \q
After installing SqlSynthGen, we create a .env file to set some environment variables to define the source database as the one linked at the bottom of the PKDD’99 page, and the destination database as the local one:
.env
SRC_DSN="mariadb+pymysql://guest:relational@relational.fit.cvut.cz:3306/Financial_ijs"
DST_DSN="mariadb+pymysql://myuser:mypassword@localhost:3306/financial"
Generating Data Uniformly at Random
We run SqlSynthGen’s make-tables command to create a file called orm.py that contains the schema of the source database.
$ sqlsynthgen make-tables
Inspecting the orm.py file, we see that the tkeys table has column called goodClient, which is a TINYINT.
SqlSynthGen doesn’t know what to do with TINYINT columns, so we need to create a config file to tell it how to handle them.
This isn’t necessary for normal Integer columns.
Looking at the goodClient values:
id |
goodClient |
|---|---|
0 |
1 |
1 |
0 |
2 |
0 |
3 |
0 |
we see that they are always 0 or 1 so we will pick randomly from 0 and 1 for our synthetic value:
config.yaml
---
tables:
tkeys:
row_generators:
- name: generic.random.choice
columns_assigned: goodClient
args:
- [0, 1]
We run SqlSynthGen’s create-generators command to create df.py, which contains a generator class for each table in the source database:
$ sqlsynthgen create-generators --config config.yaml
We then run SqlSynthGen’s create-tables command to create the tables in the destination database:
$ sqlsynthgen create-tables
Note that, alternatively, you could use another tool, such as mysqldump to create the tables in the destination database.
Finally, we run SqlSynthGen’s create-data command to populate the tables with synthetic data:
$ sqlsynthgen create-data --num-passes 100
This will make 100 rows in each of the nine tables with uniformly random data.
Improving Data Generation with Vocabularies
We can do better than generating data randomly and uniformly, however.
We notice that the districts table doesn’t contain any sensitive data so we choose to copy it whole to the destination database:
config.yaml
---
tables:
tkeys:
row_generators:
- name: generic.random.choice
columns_assigned: goodClient
args:
- [0, 1]
districts:
vocabulary_table: true
We can export the vocabularies to .yaml files, delete the old synthetic data, import the vocabularies and create new synthetic data with:
$ sqlsynthgen create-generators
$ sqlsynthgen remove-data
$ sqlsynthgen create-vocab
$ sqlsynthgen create-data --num-passes 100
This will give us an exact copy of the districts table:
id |
A2 |
A3 |
A4 |
A5 |
|---|---|---|---|---|
1 |
Hl.m. Praha |
Prague |
1204953 |
0 |
2 |
Benesov |
central Bohemia |
88884 |
80 |
3 |
Beroun |
central Bohemia |
75232 |
55 |
Adding a Foreign Key
We notice that the source database does not have a foreign key constraint between the clients.tkey_id column and the tkeys.id column, even though it looks like there ought to be one.
We add it manually to the orm.py file:
orm.py:
1class Clients(Base):
2 __tablename__ = 'clients'
3 __table_args__ = (
4 ForeignKeyConstraint(['district_id'], ['districts.id'], ondelete='CASCADE', onupdate='CASCADE', name='clients_ibfk_1'),
5 # Added manually
6 ForeignKeyConstraint(['tkey_id'], ['tkeys.id'], ondelete='CASCADE', onupdate='CASCADE', name='clients_tkey_id'),
7 )
8 ...
We’ll need to recreate the df.py file, the destination database and the data:
$ sqlsynthgen create-generators --config-file config.yaml --force
$ sqlsynthgen remove-tables --yes
$ sqlsynthgen create-tables
$ sqlsynthgen create-vocab
$ sqlsynthgen create-data --num-passes 100
We now have a FK relationship and all synthetic values of clients.tkey_id exist in the synthetic tkeys.id column.
Marginal Distributions with Differential Privacy
For many of the remaining categorical columns, such as cards.type, we may decide that we want to use the real values in the right proportions.
Here is a sample of the cards table:
id |
disp_id |
type |
issued |
|---|---|---|---|
1 |
9 |
gold |
1998-10-16 |
2 |
19 |
classic |
1998-03-13 |
3 |
41 |
gold |
1995-09-03 |
We can take the real values in the right proportions, and even add noise to make them differentially private by using the source-statistics and SmartNoise SQL features:
config.yaml
---
src-stats:
- name: count_card_types
query: >
SELECT
c.type AS card_type,
d.client_id
FROM cards c
JOIN disps d
on c.disp_id = d.id
dp-query: >
SELECT
count(*) AS the_count,
card_type
FROM query_result
GROUP BY card_type
epsilon: 0.1
delta: 0.0001
snsql-metadata:
client_id:
name: client_id
type: int
private_id: True
card_type:
name: card_type
type: string
private_id: False
row_generators_module: my_row_generators
tables:
cards:
row_generators:
- name: my_row_generators.my_card_func
kwargs:
stats: SRC_STATS["count_card_types"]
columns_assigned: type
tkeys:
row_generators:
- name: generic.random.choice
columns_assigned: goodClient
args:
- [0, 1]
districts:
vocabulary_table: true
We define a custom row-generator to use the source statistics and Python’s random.choices() function to choose a value:
my_row_generators.py
import random
def my_card_func(stats):
"""Choose a weighted card type."""
total = sum([x["the_count"] for x in stats])
return random.choices(
population=tuple(x["card_type"] for x in stats),
weights=tuple(x["the_count"] / total for x in stats),
)[0]
As before, we will need to re-create df.py and the data.
$ sqlsynthgen create-generators --config-file config.yaml --force
$ sqlsynthgen make-stats --config-file config.yaml --force
$ sqlsynthgen remove-data --yes
$ sqlsynthgen create-vocab
$ sqlsynthgen create-data --num-passes 100
For further refinement, you can use “story generators” to create inter-table correlations so that, for example, the number of loan applications depends on the number of cards they have or the average amount of a bank transfer depends on the home city of a client. See the introduction for more.