Press "Enter" to skip to content

Category: Python

Reading about Python’s Poetry

Poetry bills itself as “Python packaging and dependency management made easy” – I will dive in a bit more…

Installing and configuring

  • Poetry requires Python 3.8+ (that’s a lot of missing Python)
  • Has the classic “fun” insecure installer approach by default curl -sSL https://install.python-poetry.org | python3 – not my favorite, but it looks like Poetry has its own bootstrapping problem right now.
  • Uses the pyproject.toml format for configuration, with the tool.poetry, tool.poetry.dependencies, and build-system are the most important starting fields.
    • Poetry has a pretty well-thought-out use of the pyproject format.

Commands and Usage

  • Initializing an existing project is as easy as poetry init it will ask you questions about your package and get your boostrapped.
  • You can add packages with poetry add and poetry remove, first blush it feels like I am using cargo.
    • Upon upgrade/change Poetry removes old packages.There are a few GitHub issues about it, so if you are Windows and you want a faultless experience you might want to skip Poetry for now.
    • This will also resolve the package versions to ensure compatibility – a clear positive knowing your packages work out of the box together, but I have burned by Conda before, so a package solver always gets some side-eye.
  • poetry install grabs the packages and installs them in your environment.
  • poetry update Gets the latest versions of the dependencies and write them to your lock file.
  • poetry run runs the command within the current virtualenv.
    • This combines with the tool.poetry.scripts section of the pyproject file – you can define a file to run and then poetry run special-command to run your special-command.
  • poetry shell spawns a shell in the virtual environment (really useful for testing random stuff).
  • There’s a few more commands around lock files and more esoteric needs for the build system, so I will stop there for now.

Other interesting differences from a simplified pip env

  • Poetry is much more active in managing environments than a simple pip+venv setup, and actively takes steps to activate/validate the version of Python and your current environment when running code.
  • There’s a bit more on the type of build tools you can emit, versioning, and dependency groups which you generally wouldn’t have in the simpler tooling modes.

Final Thoughts

Overall my first blush with Poetry is that it’s a very cool tool (if you are not on Windows) and that it definitely seems that once it’s set up. It seems like you’d have more luck getting new packages added to existing projects without the “fun” of Python packaging issues arising suddenly in the wild (or hopefully your full featured test suite)/

Because of the file deletion issue (and me on Windows most of the time) I am still going to be sticking to pip+venv. Any add/remove command has about a 50% chance of going south for the boxes I am using.

Fixing – Error: “PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

During my attempt to connect PyCharm to a SQL Server data source with Windows Authentication and ran into a slew of errors. Upon searching I noticed there was a dearth of resources on this issue.

Some posts mention messing with your certificate store but with a local sever I dont have the interest in configuring certificate based auth.

From experience I knew a security issue reported in SQL Auth generally is related to the certificates/encryption settings on the connection, often turning it off works fine (don’t do this in production.)

I found nothing useful in the SSL/TLS – clearly the standard “uncheck encryption” trick isn’t managed here.

 

 

Looking further I found the Driver: section, here the MSSQL driver settings I needed are available.

picking advanced/driver settings

 

 

 

Click the Advanced tab.clicking the advanced tab in the driver settings

 

 

 

 

Update integratedSecurity to true and encrypt to false. changing pycharm driver settings

 

 

This fixed my authentication issues with the local server.connected window

Predicting When Your SQL Server Will Run Out Of Space via Colab, Plotly, and Pandas

No, not Azure Data Studio or Databricks notebooks (yet) – I wanted to give Google Colab a spin and the big G hosted Jupyter notebook has an expansive free tier with more RAM than my work computer and a graphics card attached to boot!

Going forward with my explorations I wanted to make use of an open-ish dataset, so I used Stack Overflow via Brent Ozar’s blog post on getting the data

Keep in mind its almost 500GB (zip + extract) – and while in an ideal situation we wouldn’t be torrenting a database, extracting, attaching, querying, etc to get at our insights I’ve honestly seen slower iterations at many enterprises so it wasn’t a big deal for me.

If you’ve never used Stack’s database its worth mentioning you an also avoid this all by running queries on their own servers at the Stack Exchange Data Explorer. There’s more tables there too!

If you just want the notebook code click here!

An SSMS object explorer view of the tables in Stackoverflow with only 12 tables
Turns out you can run a global website on 12 tables?

If you haven’t touched Stack’s database before you might be surprised how svelte it is, the backup I am working with only has 12 tables. That’s right, just 12.

The truth is that you don’t need to make things complex to make a useful product, and of course Stack/Brent have removed some ancillary and internal things which is fine by me.

The first question I wanted to model out was a bigger issue with on-premises databases – when are we going to run out of storage?

Back in the day I’d cheat with msdb backups, comparing compressed sized to actuals, and moving on. However I don’t have a historical reference for Stack Overflow… so what can I do?

Taking a look at the tables we see a commonality in many tables – CreationDate! It looks like the rows faithfully are stamped when they are created.

We might not be able to truly look back in time but if trust the database we have a chance to query the contents and get back some interesting data about Stack.

So let’s head over to the notebook if you like, or sit back and read more.

Initial overview of the SQL Script
Initial overview of the SQL Script

The first text block contains the dynamic SQL to query the tables with CreationDate and construct additional queries to run – I could have automated this or added a cursor but given I am running it locally in SSMS and copying the CSV by hand today, it wasn’t worth it.

In our first code block ([25]) we install plotly==5.2.1 because of some newer and nicer features, setup the file upload of the csv([36]), and define the headers because SSMS is too lazy to include headers([37]).

Installing pre-reqs and uploading files
Installing pre-reqs and uploading files.

Next in code block ([38]) we define a DANGER_ZONE, the constant that should make us very worried because that’s how much space we have (in megabytes), and add a function to add DANGER_ZONES to figures easily.

Summarizing individual data tables
Summarizing individual data tables

We make a line figure, render it, and we see something!

DANGER ZONE added with tables.
DANGER ZONE added with tables.

We’re obviously not done yet, we’re going to need add some data to predict when this is going to fail.

In code block([39]) we sum and group the RunningTotalMBUsage values such that we get an overall value for each month’s total megabytes for all tables.

Add overall summaries of the values to the dataframe.
Add overall summaries of the values to the dataframe.

And happily in code block([135]) we see that we have our overall graph looking good, though obviously its a bit closer to our DANGER_ZONE!

Add overall summaries of the values to the dataframe.
Add overall summaries of the values to the dataframe.
Overall line added to the graph
Overall line added to the graph

Finally, in code block ([136]) we take all the lessons we’ve learned and project out a set of points to determine what a linear best fit line would look like.

Regression fit again actual measurements.
Regression fit again actual measurements.

Conclusions

Given a 500GB limit, we can see that a prediction of April 30th, 2026 is given for Stack’s storage being overwhelmed – and if you look closely you can see that Stack’s growth of data isn’t strictly linear, so it’s probably going to happen sooner than that. In reality I bet they have at least twice as much as that available 🙂

I had a good time using Colab, and the only frustrating moment during the process was when my sign in timed out, it silently switched me to a new login context and couldn’t save, porting that back over got annoying for a bit. Otherwise its great!

Feel free to copy/fork my notebook if you want to make the same predictions for you environment and try it out for yourself, I know I will. And if you were wondering, yes, I was listening to Kenny the entire time.

 

 

 

Useful Django Bits

I have been busy working on some other non-SQL related side projects recently, and I wanted to note some of the pieces of code I have been appreciating recently.

https://github.com/PaesslerAG/django-currentuser is a simple plugin that allows you to reference your current user context in your models various functions. This greatly simplified some user management functions within my codebase, as I could express it all in the model.

https://www.django-rest-framework.org/ is a powerful framework on top of Django that allows you to build a straightforward rest framework. Django doesn’t have object based permissions and I have been building out the next version of my codebase with it, its definitely a lot more pluggable than anything I designed.

https://pythoncircle.com/post/439/server-access-logging-in-django-using-middleware/ an easy way to track user access – one migration adds a log to your table, and you get whatever you want out of each request flow. Be careful that you follow your GDPR/CCPA guidelines!

https://github.com/pennersr/django-allauth is something I have been investigating but it seems a bit much for my goals, I will come back and update more about this soon.

Wrapping my head around object level permissions with Django

A quick aside, if you read my SQL stuff – I have been working on a side project with python, Django (a web framework), with a focus on bringing some sanity to recipes.

For the site I needed to implement a system where a user owns recipes, and other users cant modify them.

In Django there’s a built in auth model that supports both user registration and model level permissions(yay!), but it means that naively all users could edit all recipes(boo!)

I did some quick searching because I was hopeful that I could reuse some of the well thought out components already available.

The big options I found were django-guardian, Rules, django-rest, or figuring it out myself.

Checking in with Rules, it looks like its not so much a Django specific implementation as much as a way to implement rules + predicates to get things done, seems interesting but its not clear from the older examples provided how its supposed to work with a modern Django – I am going to skip it for now.

Checking in with django-guardian, I am seeing better examples that are relevant, but truthfully I need 1 owner as a permission.

On the face of it, I feel like a lot of these systems are really complex for my purposes – its cool to be the ultimate system but I just need a system that provides:

  1. Users which have permission to objects.
  2. Permissions means you have all roles on the object (edit/delete.)
  3. Each object that is ownable on creation must have a creator/author, who when its assigned can automatically edit/delete the thing.

Since I don’t need a complex permission system, I don’t care about meta-magic, and I just want to design a straight forward set of gates to manage, I decided to go off on my own and work on implementing an author that’s checked.

I added an author field to my model, the current user sets it in the model if its not already set, and then I use the is_current_authenticated_user() function from the django-currentuser package to tell if this matches the current request.

I also implemented a simple map behind the scenes that connects forms to models to pages so that the add/edit functions can just do their work and pass the correct set of forms back to the correct set of pages – I will have to keep looking because it wasnt clear how to infer the model’s form class automatically without some mightier python than I current posses.

I disable the form fields for niceties in the UI if you don’t own the object (so you dont accidentally try and edit it), but to be sure I also prevent saving on the backend if things don’t match up, and there’s an added bonus of free CSRF built into Django.

And so after all that I have straight forward solution without even touching the rules code – it took me more time to noodle over third party options than just set down and write it.