Press "Enter" to skip to content

Category: Python

Predicting When Your SQL Server Will Run Out Of Space via Colab, Plotly, and Pandas

No, not Azure Data Studio or Databricks notebooks (yet) – I wanted to give Google Colab a spin and the big G hosted Jupyter notebook has an expansive free tier with more RAM than my work computer and a graphics card attached to boot!

Going forward with my explorations I wanted to make use of an open-ish dataset, so I used Stack Overflow via Brent Ozar’s blog post on getting the data

Keep in mind its almost 500GB (zip + extract) – and while in an ideal situation we wouldn’t be torrenting a database, extracting, attaching, querying, etc to get at our insights I’ve honestly seen slower iterations at many enterprises so it wasn’t a big deal for me.

If you’ve never used Stack’s database its worth mentioning you an also avoid this all by running queries on their own servers at the Stack Exchange Data Explorer. There’s more tables there too!

If you just want the notebook code click here!

An SSMS object explorer view of the tables in Stackoverflow with only 12 tables
Turns out you can run a global website on 12 tables?

If you haven’t touched Stack’s database before you might be surprised how svelte it is, the backup I am working with only has 12 tables. That’s right, just 12.

The truth is that you don’t need to make things complex to make a useful product, and of course Stack/Brent have removed some ancillary and internal things which is fine by me.

The first question I wanted to model out was a bigger issue with on-premises databases – when are we going to run out of storage?

Back in the day I’d cheat with msdb backups, comparing compressed sized to actuals, and moving on. However I don’t have a historical reference for Stack Overflow… so what can I do?

Taking a look at the tables we see a commonality in many tables – CreationDate! It looks like the rows faithfully are stamped when they are created.

We might not be able to truly look back in time but if trust the database we have a chance to query the contents and get back some interesting data about Stack.

So let’s head over to the notebook if you like, or sit back and read more.

Initial overview of the SQL Script
Initial overview of the SQL Script

The first text block contains the dynamic SQL to query the tables with CreationDate and construct additional queries to run – I could have automated this or added a cursor but given I am running it locally in SSMS and copying the CSV by hand today, it wasn’t worth it.

In our first code block ([25]) we install plotly==5.2.1 because of some newer and nicer features, setup the file upload of the csv([36]), and define the headers because SSMS is too lazy to include headers([37]).

Installing pre-reqs and uploading files
Installing pre-reqs and uploading files.

Next in code block ([38]) we define a DANGER_ZONE, the constant that should make us very worried because that’s how much space we have (in megabytes), and add a function to add DANGER_ZONES to figures easily.

Summarizing individual data tables
Summarizing individual data tables

We make a line figure, render it, and we see something!

DANGER ZONE added with tables.
DANGER ZONE added with tables.

We’re obviously not done yet, we’re going to need add some data to predict when this is going to fail.

In code block([39]) we sum and group the RunningTotalMBUsage values such that we get an overall value for each month’s total megabytes for all tables.

Add overall summaries of the values to the dataframe.
Add overall summaries of the values to the dataframe.

And happily in code block([135]) we see that we have our overall graph looking good, though obviously its a bit closer to our DANGER_ZONE!

Add overall summaries of the values to the dataframe.
Add overall summaries of the values to the dataframe.
Overall line added to the graph
Overall line added to the graph

Finally, in code block ([136]) we take all the lessons we’ve learned and project out a set of points to determine what a linear best fit line would look like.

Regression fit again actual measurements.
Regression fit again actual measurements.

Conclusions

Given a 500GB limit, we can see that a prediction of April 30th, 2026 is given for Stack’s storage being overwhelmed – and if you look closely you can see that Stack’s growth of data isn’t strictly linear, so it’s probably going to happen sooner than that. In reality I bet they have at least twice as much as that available 🙂

I had a good time using Colab, and the only frustrating moment during the process was when my sign in timed out, it silently switched me to a new login context and couldn’t save, porting that back over got annoying for a bit. Otherwise its great!

Feel free to copy/fork my notebook if you want to make the same predictions for you environment and try it out for yourself, I know I will. And if you were wondering, yes, I was listening to Kenny the entire time.

 

 

 

Useful Django Bits

I have been busy working on some other non-SQL related side projects recently, and I wanted to note some of the pieces of code I have been appreciating recently.

https://github.com/PaesslerAG/django-currentuser is a simple plugin that allows you to reference your current user context in your models various functions. This greatly simplified some user management functions within my codebase, as I could express it all in the model.

https://www.django-rest-framework.org/ is a powerful framework on top of Django that allows you to build a straightforward rest framework. Django doesn’t have object based permissions and I have been building out the next version of my codebase with it, its definitely a lot more pluggable than anything I designed.

https://pythoncircle.com/post/439/server-access-logging-in-django-using-middleware/ an easy way to track user access – one migration adds a log to your table, and you get whatever you want out of each request flow. Be careful that you follow your GDPR/CCPA guidelines!

https://github.com/pennersr/django-allauth is something I have been investigating but it seems a bit much for my goals, I will come back and update more about this soon.

Wrapping my head around object level permissions with Django

A quick aside, if you read my SQL stuff – I have been working on a side project with python, Django (a web framework), with a focus on bringing some sanity to recipes.

For the site I needed to implement a system where a user owns recipes, and other users cant modify them.

In Django there’s a built in auth model that supports both user registration and model level permissions(yay!), but it means that naively all users could edit all recipes(boo!)

I did some quick searching because I was hopeful that I could reuse some of the well thought out components already available.

The big options I found were django-guardian, Rules, django-rest, or figuring it out myself.

Checking in with Rules, it looks like its not so much a Django specific implementation as much as a way to implement rules + predicates to get things done, seems interesting but its not clear from the older examples provided how its supposed to work with a modern Django – I am going to skip it for now.

Checking in with django-guardian, I am seeing better examples that are relevant, but truthfully I need 1 owner as a permission.

On the face of it, I feel like a lot of these systems are really complex for my purposes – its cool to be the ultimate system but I just need a system that provides:

  1. Users which have permission to objects.
  2. Permissions means you have all roles on the object (edit/delete.)
  3. Each object that is ownable on creation must have a creator/author, who when its assigned can automatically edit/delete the thing.

Since I don’t need a complex permission system, I don’t care about meta-magic, and I just want to design a straight forward set of gates to manage, I decided to go off on my own and work on implementing an author that’s checked.

I added an author field to my model, the current user sets it in the model if its not already set, and then I use the is_current_authenticated_user() function from the django-currentuser package to tell if this matches the current request.

I also implemented a simple map behind the scenes that connects forms to models to pages so that the add/edit functions can just do their work and pass the correct set of forms back to the correct set of pages – I will have to keep looking because it wasnt clear how to infer the model’s form class automatically without some mightier python than I current posses.

I disable the form fields for niceties in the UI if you don’t own the object (so you dont accidentally try and edit it), but to be sure I also prevent saving on the backend if things don’t match up, and there’s an added bonus of free CSRF built into Django.

And so after all that I have straight forward solution without even touching the rules code – it took me more time to noodle over third party options than just set down and write it.