passtheroc

Why Data Scientists Should Use an ORM Like SQLAlchemy

In this post, I'm going to advocate for data scientists using an ORM like the one in SQLAlchemy. But I should start with a disclaimer about my experience, because the case I'm going to make isn't going to be relevant to everyone. My experience is at small startups, where I was responsible for working on a wide variety of tasks, and where I had to pull and clean data myself. If you work at a larger company, with a team of data engineers whose job it is to provide you with clean datasets, this post will be less relevant. Also, I came to data science from a background in statistics; if you have a compsci or a programming background, this may be old hat to you.

In this post, I'm assuming you're familiar with SQL, but I assume no familiarity with ORMs or SQLAlchemy. In fact, let's start with a quick intro to SQLAlchemy.

ORM: Object Relational Mapper

I like this definition from Active Record, a popular ORM for Ruby on Rails:

Object-Relational Mapping, commonly referred to as its abbreviation ORM, is a technique that connects the rich objects of an application to tables in a relational database management system. Using ORM, the properties and relationships of the objects in an application can be easily stored and retrieved from a database without writing SQL statements directly and with less overall database access code.

In short: an ORM lets you work with objects, obviating the need to write raw SQL.

SQLAlchemy contains a Python ORM

SQLAlchemy contains a Python ORM (object relational mapper) for SQL databases. I recommend looking at the tutorial, but here's the idea in a nutshell. We can create classes that map to database tables:

In [ ]:
class Member(MssqlBase):
    __tablename__ = "MemberFacts"
    __table_args__ = {'autoload': True, 'autoload_with': mssql.engine}

Here, the class Member maps to the table MemberFacts. 'autoload': True tells SQLAlchemy to load information about the table, so it 'knows' that columns in the table and any relationships with other tables. Here's an example of how I might use this class to query for a subset of members:

In [ ]:
members = session.query(Member). \
            filter(Member.created_at > datetime.datetime(2014, 1, 1)). \
            limit(10).all()

Even if you knew nothing about SQLAlchemy, you could probably get the gist of what's happening here: I'm querying for members created after 1/1/2014, and taking the first 10. Behind the scenes, SQLAlchemy will take this statement, translate it into raw sql, run the sql, and translate the results back into instances of the Member class. Each of these instances has the columns of the MemberFacts table as attributes, so if I wanted to create a pandas dataframe, I could do something like this:

In [ ]:
import pandas as pd

def data_frame(query, columns):
    """
    Takes a sqlalchemy query and a list of columns, returns a dataframe.
    """
    def make_row(x):
        return dict([(c, getattr(x, c)) for c in columns])       
    return pd.DataFrame([make_row(x) for x in query])

query = session.query(Member).filter(Member.created_at > 
                                     datetime.datetime(2014, 1, 1)).limit(10)

# dataframe with just two fields
df = data_frame(query, ['id', 'created_at'])

# dataframe with all fields in the table
df = data_frame(query, [c.name for c in Member.__table__.columns])

So if we're going to end up with a dataframe anyway, why take this detour through the ORM? To answer that question, we first have to get acquainted with SQLAlchemy's 'hybrid attributes'.

Hybrid Attributes

Using hybrid attributes, we can add attributes to our classes that aren't directly linked to columns in the underlying table. Here, I'm defining the attribute 'reference_count', the number of references a job candidate has linked to this profile:

In [ ]:
class CandidateProfile(MysqlBase):
    __tablename__ = "candidate_profile"
    __table_args__ = {'autoload': True, 'autoload_with': mysql.engine}
    references = relationship(CandidateReference, backref='candidate_profile')
    
    @hybrid_property
    def reference_count(self):
        return len([_ for _ in self.references])

    @reference_count.expression
    def reference_count(cls):
        return select([func.count(CandidateReference.profile_id)]). \
            where(CandidateReference.profile_id == cls.id). \
            label('reference_count')

Note the two definitions of reference_count: the first defines the behavior at the instance level, and is 'pure' python; the second defines behavior at the class level, and uses SQLAlchemy syntax. It's this double definition that makes these properties 'hybrid' properties. Using the first definition, if foo is an instance of CandidateProfile, I can use foo.reference_count to get the number of references linked to this profile. Using the second definition, I can query by reference count:

In [ ]:
candidate_profiles_with_3_references = session.query(CandidateProfile). \
                                       filter(CandidateProfile.reference_count == 3)

I've snuck in another concept here: note that this property relies on a relationship between CandidateProfile and CandidateReference. While I've told SQLAlchemy how I want to refer to this relationship (as the attribute references), I didn't have to explicitly describe how to make the join between the two tables because SQLAlchemy figured that out when it autoloaded information about the table from the database.

Now we're finally ready for the two main reasons why an ORM is valuable for data scientists:

  • hybrid properties can enforce consistent feature definitions, and
  • hybrid properties make complex feature definition easy.

Hybrid Properties Can Enforce Consistent Definitions Across a Team

Databases are messy. Maybe you have demo accounts in your database, created by people you work with. Maybe a change in the application means that accounts created before a certain date have special status and don't need to buy a subscription.

This creates a challenge for a data science team because not accounting for one of these countless quirks could derail an analysis. If (for example) you are building a predictive model of subscription, and you don't know that users with a cryptic value in an arcane column on the users table will never subscribe, you're going to have a bad time. My first thought when an analysis or model produces counter-intuitive results is to double- and triple- check the dataset, a process that can be time consuming, especially when it involves looking at raw SQL queries.

Hybrid properties address this problem by making it possible to standardize the definitions of complex concepts like 'this user is capable of subscribing.' Because the definition is in one place, and in version control, the whole team is on the same page. Take a look at this actual example from a recent project, in which I'm defining whether a Member is_legit. (This is an admittedly vague term, and I should probably change it to is_real_member or is_not_test_account.) As you can see, there were 4 different ways in which a member could be not legit:

In [ ]:
class Member(MssqlBase):
    __tablename__ = "MemberFacts"
    __table_args__ = {'autoload': True, 'autoload_with': mssql.engine}
    
    @hybrid_property
    def is_legit(self):
        return (self.AffinionId not in wilbur.FAKE_AFFINIONIDS) & \
               (self.date_created_dt <= apptime.now()) & \
               (self.affinion_code.ReferrenceCode not in wilbur.IGNOREABLE_CODES) & \
               (self.affinion_status.RenewalDate < datetime.datetime(2099, 1, 1))

    @is_legit.expression
    def is_legit(cls):
        return and_(cls.AffinionId.notin_(wilbur.FAKE_AFFINIONIDS),
                    cls.date_created_dt <= apptime.now(),
                    cls.reference_code.notin_(wilbur.IGNOREABLE_CODES),
                    cls.renewal_date < datetime.datetime(2099, 1, 1))

In the above example, wilbur is the name of the module for the project, and I've defined concepts like FAKE_AFFINIONIDS and IGNOREABLE_REFCODES at the module level.

Here's an example from a different project, in which I define a concept at the class level. Again, it illustrates how an ORM lets you define a concept once and use it across your analyses:

In [ ]:
class School(PsqlBase):
    __tablename__ = "schools"
    __table_args__ = {'autoload': True, 'autoload_with': psql.engine}
    
    ONLINE_SCHOOLS = [3402, 131, 22673, 4095, 3731]

    @hybrid_property
    def is_online(self):
        return self.id in self.ONLINE_SCHOOLS

    @is_online.expression
    def is_online(cls):
        return cls.id.in_(cls.ONLINE_SCHOOLS)

In both of these cases, you can imagine the value of having these concepts defined in a single place, with everyone on a team using the same definition: it will cut down on questions like, 'did you exclude this set of users?' and 'how did you define online schools?'

Now, the second reason and ORM is valuable for data scientists:

Hybrid Attributes Make Complex Feature Generation Easy

In my experience, data science is 70% data munging and 40% feature generation. Using an ORM can obviate the former and make the latter easier.

Let's say I'm working on a project and I want a dummy variable for whether or not a user bookmarked an art school because I think users interested in art schools are less likely to use certain paid features on our site. But because of the way our database is architected, making the connection between a user and the type of school she bookmarked requires three joins. Before I started using SQLAlchemy, there were three ways I may have approached this problem:

  1. Write a complicated query every time I want to use the dummy variable, with either a correlated subquery or a SUM(IF..)) with a GROUP BY, being careful that I keep my dataset at the user level
  2. Write a set of simple queries, pulling the various tables, and doing the munging client-side with pandas
  3. Work with a data architect/engineer to add this feature to the users table in the data warehouse

None of these solutions is ideal. 1. and 2. are fast, but are filled with opportunities to make a mistake, and create code that is unpleasant to read. 3. is best from an organizational perspective: if the feature is in the warehouse, everyone on the team can use it, and maintaining accurate data in the warehouse is the engineer/architect's specialty. If we end up using the feature a lot, I would recommend going with 3. But if we aren't sure yet how much we'll use the feature, or if our data engineer is busy with more pressing problems, it would be nice to have another option.

That's where the ORM comes in. Because the ORM 'understands' the relationship between tables, it makes coding this three-join feature easy and readable. This example is from an actual project I worked on, and the code I used for the feature is below. The join is from User to UserSchool to School to SchoolIPEDS:

In [ ]:
class User(PsqlBase):
    __tablename__ = "users"
    __table_args__ = {'autoload': True, 'autoload_with': psql.engine}
    schools = relationship('UserSchool', backref='user')
    
    @hybrid_property
    def bookmarked_schools(self):
        return [s for s in self.schools if s.is_bookmarked]
    
    @hybrid_property
    def bookmarked_art_school(self):
        art_schools = [bs for bs in self.bookmarked_schools if 
                       bs.school.ipeds_data.is_art_school]
        return True if art_schools else False
    
class UserSchool(PsqlBase):
    __tablename__ = "user_schools"
    __table_args__ = {'autoload': True, 'autoload_with': psql.engine}
    school = relationship(School, backref='user_school')
    
class School(PsqlBase):
    __tablename__ = "schools"
    __table_args__ = {'autoload': True, 'autoload_with': psql.engine}
    ipeds_data = relationship("SchoolIPEDS", backref='school', uselist=False)
    
class SchoolIPEDS(PsqlBase):
    __tablename__ = "school_ipeds"
    __table_args__ = {'autoload': True, 'autoload_with': psql.engine}
    
    @hybrid_property
    def is_art_school(self):
        return self.ccbasic == 30

(Note that I haven't defined the class-level behavior for some of these properties - that's ok as long as I don't try to use them to filter queries.)

Secondary Benefits

These remaining benefits are nothing to sneeze at, but I'm calling them 'secondary' because they're less data-science centric.

Expressive and Readable Code

Two years ago, I was a complete rube when it came to code readability. I came from a background in statistics, I had only worked at tiny startups, and I had not worked closely with any developers. When I joined howaboutwe and the developers talked about 'expressive' code, I thought they meant 'expressive' as in 'expressive of self,' and I mentally labelled them hipsters. Sorry guys, I was wrong.

So in case you're where I was two years ago: the experience of reading code should be like the experience of reading prose. It shouldn't take more than a couple read-throughs for someone other than the writer to fully comprehend what the code does.

SQLAlchemy helps in this regard by turning this:

SELECT *
FROM MemberFacts MF
  LEFT JOIN AffinionCode AC ON MF.affinion_code = AC.code
  LEFT JOIN AffinionMembershipStatus AMS ON MF.rtid = AMS.rtid
WHERE AffinionId NOT IN ('AffinionDiagnostic','AF3')
AND   MF.created_at < '2014-10-01'
AND   AC.ReferrenceCode NOT IN ('refCode1')
AND   AMS.RenewalDate < '2099-01-01';

into this:

members = session.query(Member).filter(Member.is_legit).all()


It's clear that the SQLAlchemy version is more pleasant to read and makes it easier to understand the coder's intent. Maintaining code readability may seem like an onerous chore, but you will thank yourself when you have to read your own code two months later. I've found that despite the intention going in, very few analyses actually are one-offs: you will be going back to your code at some point in the future.

Unit Tests

Similar to code readability, unit tests are a staple of software development, but are a foreign concept to many data scientists. In short: using an ORM like SQLAlchemy makes it possible to write tests of your feature definitions, ensuring they are measuring what you want them to measure.

Database Agnosticism

SQLAlchemy supports PostgreSQL, MySQL, Microsoft SQL Server, Oracle, and SQLLite. Each of these databases uses a slightly different dialect of SQL. For example, MySQL supports the IF function, while at least two of the others (PostgreSQL and SQL Server) don't (they use a CASE instead). Should your company decide to switch backends, making a few changes in your SQLAlchemy mappings and hybrid property definitions will be much easier than combing through all your queries and make line edits. Many changes will be handled by SQLAlchemy itself: should your company make the curious decision to switch from MySQL to SQL Server, you wouldn't have to make any changes to this expression:

In [ ]:
members = session.query(Member).\
                  filter(Member.created_at > datetime.datetime(2014, 1, 1)). \
                  limit(10).all()

... even though SQL Server doesn't support limit. SQLAlchemy will make the correct translation for you, once you tell it about the change in backends.

Analogs in Looker

One last perk of getting familiar with data modeling in an ORM is that your hybrid properties can be translated into BI tools like the cool new kid on the block, Looker. For example, here's a property I've defined in SQLAlchemy:

In [ ]:
class Subscription(MssqlBase):
    __tablename__ = "Subscriptions"
    __table_args__ = {'autoload': True, 'autoload_with': mssql.engine}
    
    @hybrid_property
    def had_free_trial(self):
        return 1 if self.trial_began date else 0
    
    @had_free_trial.expression
    def had_free_trial(cls):
        return cast(case([(cls.trial_began == None, 1)], else_=0), Float)

And here's the corresponding property in LookML, Looker's modeling language:

- view: subscriptions
  sql_table_name: Subscriptions
  fields:

  - dimension: had_free_trial
    type: int
    sql: CAST(CASE WHEN ${TABLE}.trial_began_date is not null THEN 1 ELSE 0 END AS FLOAT)

Very similar, just using a different syntax. But the core concepts of data modeling are the same, so when you learn one, you're learning the other.


That's it for now. Hopefully, I've convinced you to consider using an ORM if you don't use one already. I may write a future post with tips for bootstrapping your way up the SQLAlchemy learning curve.