Example: Loan Data

Intro

There are many potential applications of synthetic data in banking and finance where the nature of the data, being both personally and commercially sensitive, may rule out sharing real, identifiable data.

Here, we show how to use SqlSynthGen to generate a simple (uniformly random) synthetic version of the freely-available PKDD’99 dataset. This dataset contains 606 successful and 76 not successful loan applications.

Setup

The PKDD’99 dataset is stored on a MariaDB database, which means that we need a local MariaDB database to store the synthetic data. MariaDB installation instructions can be found here. We presume that you have a local server running on port 3306, with a user called myuser, a password mypassword and a database called financial.

$ mysql
MariaDB > create user 'myuser'@'localhost' identified by 'mypassword';
MariaDB > create database financial;
MariaDB > grant all privileges on financial.* to 'myuser'@'localhost';
MariaDB > \q

After installing SqlSynthGen, we create a .env file to set some environment variables to define the source database as the one linked at the bottom of the PKDD’99 page, and the destination database as the local one:

.env

SRC_DSN="mariadb+pymysql://guest:relational@relational.fit.cvut.cz:3306/Financial_ijs"
DST_DSN="mariadb+pymysql://myuser:mypassword@localhost:3306/financial"

Generating Data Uniformly at Random

We run SqlSynthGen’s make-tables command to create a file called orm.py that contains the schema of the source database.

$ sqlsynthgen make-tables

Inspecting the orm.py file, we see that the tkeys table has column called goodClient, which is a TINYINT. SqlSynthGen doesn’t know what to do with TINYINT columns, so we need to create a config file to tell it how to handle them. This isn’t necessary for normal Integer columns.

Looking at the goodClient values:

tkeys
id	goodClient
0	1
1	0
2	0
3	0

we see that they are always 0 or 1 so we will pick randomly from 0 and 1 for our synthetic value:

config.yaml

---
tables:
  tkeys:
    row_generators:
      - name: generic.random.choice
        columns_assigned: goodClient
        args:
          - [0, 1]

We run SqlSynthGen’s create-generators command to create df.py, which contains a generator class for each table in the source database:

$ sqlsynthgen create-generators --config config.yaml

We then run SqlSynthGen’s create-tables command to create the tables in the destination database:

$ sqlsynthgen create-tables

Note that, alternatively, you could use another tool, such as mysqldump to create the tables in the destination database.

Finally, we run SqlSynthGen’s create-data command to populate the tables with synthetic data:

$ sqlsynthgen create-data --num-passes 100

This will make 100 rows in each of the nine tables with uniformly random data.

Improving Data Generation with Vocabularies

We can do better than generating data randomly and uniformly, however. We notice that the districts table doesn’t contain any sensitive data so we choose to copy it whole to the destination database:

config.yaml

---
tables:
  tkeys:
    row_generators:
      - name: generic.random.choice
        columns_assigned: goodClient
        args:
          - [0, 1]
  districts:
    vocabulary_table: true

We can export the vocabularies to .yaml files, delete the old synthetic data, import the vocabularies and create new synthetic data with:

$ sqlsynthgen create-generators
$ sqlsynthgen remove-data
$ sqlsynthgen create-vocab
$ sqlsynthgen create-data --num-passes 100

This will give us an exact copy of the districts table:

districts
id	A2	A3	A4	A5
1	Hl.m. Praha	Prague	1204953	0
2	Benesov	central Bohemia	88884	80
3	Beroun	central Bohemia	75232	55

Adding a Foreign Key

We notice that the source database does not have a foreign key constraint between the clients.tkey_id column and the tkeys.id column, even though it looks like there ought to be one.

We add it manually to the orm.py file:

orm.py:

class Clients(Base):
    __tablename__ = 'clients'
    __table_args__ = (
        ForeignKeyConstraint(['district_id'], ['districts.id'], ondelete='CASCADE', onupdate='CASCADE', name='clients_ibfk_1'),
        # Added manually
        ForeignKeyConstraint(['tkey_id'], ['tkeys.id'], ondelete='CASCADE', onupdate='CASCADE', name='clients_tkey_id'),
    )
    ...

We’ll need to recreate the df.py file, the destination database and the data:

$ sqlsynthgen create-generators --config-file config.yaml --force
$ sqlsynthgen remove-tables --yes
$ sqlsynthgen create-tables
$ sqlsynthgen create-vocab
$ sqlsynthgen create-data --num-passes 100

We now have a FK relationship and all synthetic values of clients.tkey_id exist in the synthetic tkeys.id column.

Marginal Distributions with Differential Privacy

For many of the remaining categorical columns, such as cards.type, we may decide that we want to use the real values in the right proportions. Here is a sample of the cards table:

cards
id	disp_id	type	issued
1	9	gold	1998-10-16
2	19	classic	1998-03-13
3	41	gold	1995-09-03

We can take the real values in the right proportions, and even add noise to make them differentially private by using the source-statistics and SmartNoise SQL features:

config.yaml

---
src-stats:
  - name: count_card_types
    query: >
      SELECT
        c.type AS card_type,
        d.client_id
      FROM cards c
      JOIN disps d
        on c.disp_id = d.id
    dp-query: >
      SELECT
        count(*) AS the_count,
        card_type
      FROM query_result
      GROUP BY card_type
    epsilon: 0.1
    delta: 0.0001
    snsql-metadata:
      client_id:
        name: client_id
        type: int
        private_id: True
      card_type:
        name: card_type
        type: string
        private_id: False
row_generators_module: my_row_generators
tables:
  cards:
    row_generators:
      - name: my_row_generators.my_card_func
        kwargs:
          stats: SRC_STATS["count_card_types"]
        columns_assigned: type
  tkeys:
    row_generators:
      - name: generic.random.choice
        columns_assigned: goodClient
        args:
          - [0, 1]
  districts:
    vocabulary_table: true

We define a custom row-generator to use the source statistics and Python’s random.choices() function to choose a value:

my_row_generators.py

import random


def my_card_func(stats):
    """Choose a weighted card type."""
    total = sum([x["the_count"] for x in stats])
    return random.choices(
        population=tuple(x["card_type"] for x in stats),
        weights=tuple(x["the_count"] / total for x in stats),
    )[0]

As before, we will need to re-create df.py and the data.

$ sqlsynthgen create-generators --config-file config.yaml --force
$ sqlsynthgen make-stats --config-file config.yaml --force
$ sqlsynthgen remove-data --yes
$ sqlsynthgen create-vocab
$ sqlsynthgen create-data --num-passes 100

For further refinement, you can use “story generators” to create inter-table correlations so that, for example, the number of loan applications depends on the number of cards they have or the average amount of a bank transfer depends on the home city of a client. See the introduction for more.