Example: Loan Data

Intro

There are many potential applications of synthetic data in banking and finance where the nature of the data, being both personally and commercially sensitive, may rule out sharing real, identifiable data.

Here, we show how to use SqlSynthGen to generate a simple (uniformly random) synthetic version of the freely-available PKDD’99 dataset. This dataset contains 606 successful and 76 not successful loan applications.

Setup

The PKDD’99 dataset is stored on a MariaDB database, which means that we need a local MariaDB database to store the synthetic data. MariaDB installation instructions can be found here. We presume that you have a local server running on port 3306, with a user called myuser, a password mypassword and a database called financial.

$ mysql
MariaDB > create user 'myuser'@'localhost' identified by 'mypassword';
MariaDB > create database financial;
MariaDB > grant all privileges on financial.* to 'myuser'@'localhost';
MariaDB > \q

After installing SqlSynthGen, we create a .env file to set some environment variables to define the source database as the one linked at the bottom of the PKDD’99 page, and the destination database as the local one:

.env

SRC_DSN="mariadb+pymysql://guest:relational@relational.fit.cvut.cz:3306/Financial_ijs"
DST_DSN="mariadb+pymysql://myuser:mypassword@localhost:3306/financial"

Generating Data Uniformly at Random

We run SqlSynthGen’s make-tables command to create a file called orm.py that contains the schema of the source database.

$ sqlsynthgen make-tables

Inspecting the orm.py file, we see that the tkeys table has column called goodClient, which is a TINYINT. SqlSynthGen doesn’t know what to do with TINYINT columns, so we need to create a config file to tell it how to handle them. This isn’t necessary for normal Integer columns.

Looking at the goodClient values:

tkeys

id

goodClient

0

1

1

0

2

0

3

0

we see that they are always 0 or 1 so we will pick randomly from 0 and 1 for our synthetic value:

config.yaml

---
tables:
  tkeys:
    row_generators:
      - name: generic.random.choice
        columns_assigned: goodClient
        args:
          - [0, 1]

We run SqlSynthGen’s create-generators command to create df.py, which contains a generator class for each table in the source database:

$ sqlsynthgen create-generators --config config.yaml

We then run SqlSynthGen’s create-tables command to create the tables in the destination database:

$ sqlsynthgen create-tables

Note that, alternatively, you could use another tool, such as mysqldump to create the tables in the destination database.

Finally, we run SqlSynthGen’s create-data command to populate the tables with synthetic data:

$ sqlsynthgen create-data --num-passes 100

This will make 100 rows in each of the nine tables with uniformly random data.

Improving Data Generation with Vocabularies

We can do better than generating data randomly and uniformly, however. We notice that the districts table doesn’t contain any sensitive data so we choose to copy it whole to the destination database:

config.yaml

---
tables:
  tkeys:
    row_generators:
      - name: generic.random.choice
        columns_assigned: goodClient
        args:
          - [0, 1]
  districts:
    vocabulary_table: true

We can export the vocabularies to .yaml files, delete the old synthetic data, import the vocabularies and create new synthetic data with:

$ sqlsynthgen create-generators
$ sqlsynthgen remove-data
$ sqlsynthgen create-vocab
$ sqlsynthgen create-data --num-passes 100

This will give us an exact copy of the districts table:

districts

id

A2

A3

A4

A5

1

Hl.m. Praha

Prague

1204953

0

2

Benesov

central Bohemia

88884

80

3

Beroun

central Bohemia

75232

55

Adding a Foreign Key

We notice that the source database does not have a foreign key constraint between the clients.tkey_id column and the tkeys.id column, even though it looks like there ought to be one.

We add it manually to the orm.py file:

orm.py:

1class Clients(Base):
2    __tablename__ = 'clients'
3    __table_args__ = (
4        ForeignKeyConstraint(['district_id'], ['districts.id'], ondelete='CASCADE', onupdate='CASCADE', name='clients_ibfk_1'),
5        # Added manually
6        ForeignKeyConstraint(['tkey_id'], ['tkeys.id'], ondelete='CASCADE', onupdate='CASCADE', name='clients_tkey_id'),
7    )
8    ...

We’ll need to recreate the df.py file, the destination database and the data:

$ sqlsynthgen create-generators --config-file config.yaml --force
$ sqlsynthgen remove-tables --yes
$ sqlsynthgen create-tables
$ sqlsynthgen create-vocab
$ sqlsynthgen create-data --num-passes 100

We now have a FK relationship and all synthetic values of clients.tkey_id exist in the synthetic tkeys.id column.

Marginal Distributions with Differential Privacy

For many of the remaining categorical columns, such as cards.type, we may decide that we want to use the real values in the right proportions. Here is a sample of the cards table:

cards

id

disp_id

type

issued

1

9

gold

1998-10-16

2

19

classic

1998-03-13

3

41

gold

1995-09-03

We can take the real values in the right proportions, and even add noise to make them differentially private by using the source-statistics and SmartNoise SQL features:

config.yaml

---
src-stats:
  - name: count_card_types
    query: >
      SELECT
        c.type AS card_type,
        d.client_id
      FROM cards c
      JOIN disps d
        on c.disp_id = d.id
    dp-query: >
      SELECT
        count(*) AS the_count,
        card_type
      FROM query_result
      GROUP BY card_type
    epsilon: 0.1
    delta: 0.0001
    snsql-metadata:
      client_id:
        name: client_id
        type: int
        private_id: True
      card_type:
        name: card_type
        type: string
        private_id: False
row_generators_module: my_row_generators
tables:
  cards:
    row_generators:
      - name: my_row_generators.my_card_func
        kwargs:
          stats: SRC_STATS["count_card_types"]
        columns_assigned: type
  tkeys:
    row_generators:
      - name: generic.random.choice
        columns_assigned: goodClient
        args:
          - [0, 1]
  districts:
    vocabulary_table: true

We define a custom row-generator to use the source statistics and Python’s random.choices() function to choose a value:

my_row_generators.py

import random


def my_card_func(stats):
    """Choose a weighted card type."""
    total = sum([x["the_count"] for x in stats])
    return random.choices(
        population=tuple(x["card_type"] for x in stats),
        weights=tuple(x["the_count"] / total for x in stats),
    )[0]

As before, we will need to re-create df.py and the data.

$ sqlsynthgen create-generators --config-file config.yaml --force
$ sqlsynthgen make-stats --config-file config.yaml --force
$ sqlsynthgen remove-data --yes
$ sqlsynthgen create-vocab
$ sqlsynthgen create-data --num-passes 100

For further refinement, you can use “story generators” to create inter-table correlations so that, for example, the number of loan applications depends on the number of cards they have or the average amount of a bank transfer depends on the home city of a client. See the introduction for more.