Go to main content

Textpattern CMS support forum

You are not logged in. Register | Login | Help

#1 2021-03-25 14:34:02

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 11,475
Website GitHub

Images and the file system layout from 4.9.0

We’re making inroads towards massively overhauling image handling in 4.9. That means moving things around. And while we’re at it, thinking about the overall structure of the file system.

There’s a GitHub issue about possibly revamping the file structure into more broad /textpattern, /public and /cms-shared top levels. Not only does this cut down on potential clashes between section names and the file system (the ol’ placeholder directory in the way message) but it means you can more easily clamp down permissions to shared/more private content such as config.php and plugins/tmp dirs. We’ll try and make the migration as painless as possible.

Specifically for images, there’s a comment on that GitHub issue that goes into details of how images will be affected when we enable multiple thumbnails for each image size.

The tl;dr is that we’re considering /your/site-root/public/images/ as an entry point, governed by a new pref co-located alongside the existing image dir pref. Would be nice if we could use the same pref, but I think that’s a stretch.

Under that dir will be some file system that will help balance the following goals:

  1. Find images relatively easily from their ID (we’re not planning on dropping the id.ext syntax, though we perhaps could).
  2. Store multiple images per size, uniquely identifiable by their resolution/size.
  3. Permit different types of file per image. So, maybe a png full size, but a webp 800px wide thumb and a 400px jpg.
  4. Traverse said structure in a sane manner from an FTP program without having to bounce endlessly up and down the dir structure to go between fullsize, 1280, 768 and 400px widths of the same image.
  5. Keep the number of files down per directory so the system can scale to millions of images.
  6. Support offloading of images to a cloud environment (if we can).

Please, if you have a little time to devote to reading the above thread/comment and either responding here or in the GitHub thread directly, it would be most appreciated. I’d like to see if there are any possible systems we can implement I might not have thought about that would balance things better.

Thank you.

Last edited by Bloke (2021-03-25 14:50:57)


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Offline

#2 2021-03-26 09:46:27

Vienuolis
Member
From: Vilnius, Lithuania
Registered: 2009-06-14
Posts: 311
Website GitHub GitLab Mastodon Twitter

Re: Images and the file system layout from 4.9.0

Very promising, thank you. Would be possible to keep outside (e.g. on a cloud storage server) original full-size assets only by this method?

Offline

#3 2021-03-26 11:39:10

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 11,475
Website GitHub

Re: Images and the file system layout from 4.9.0

Honestly I don’t know. Cloud storage is something Phil mentioned and seems like a good idea but there are of course hurdles such as how to manage a remote repository with regards permissions and access credentials for read/write access. Having to maintain a separate “original” image on the cloud and then (somehow) using that to create thumbs that are locally stored in Txp doesn’t sound all that wonderful as a workflow. But as I don’t have any idea how to handle remote repos (beyond my decades-old hackery in smd_remote_file) I’m not sure what sort of setups are viable. If anyone has any experience in this arena, please speak up so it can be taken into account.

I have a local branch working with Intervention that offers sliders for adjusting brightness, contrast, gamma, and so forth. Buttons to rotate/flip should be simple to add. Crop will be harder, and I expect it will come later as it’ll require some JavaScript that’s above my pay grade.

I am testing all this on fairly large images and it is s-l-o-w operating on the originals, so not ready for prime time yet. Need to find some way to cache the altered images for reuse inside Intervention instead of having to create a new object in memory each time: creating a working copy will make the interface respond in a timely fashion.

I also need to factor in the new prefs and upload facility to the new directory structure before it can be unleashed for wider testing. But it’s all looking very promising.


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Offline

#4 2021-03-26 16:24:36

colak
Admin
From: Cyprus
Registered: 2004-11-20
Posts: 9,096
Website GitHub Mastodon Twitter

Re: Images and the file system layout from 4.9.0

The issue of name re images is an artificial one, as the directory could be named in the preferences. One way, would be to have this name(s) in the setup process. Something like

Name the directory for your images (default images).


Yiannis
——————————
NeMe | hblack.art | EMAP | A Sea change | Toolkit of Care
I do my best editing after I click on the submit button.

Offline

#5 2021-03-27 01:48:21

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 11,475
Website GitHub

Re: Images and the file system layout from 4.9.0

Not sure I understand your comment, sorry Yiannis. The directory is already named in preferences via the img_dir pref. We need a way to keep that pref so the lookup function knows where to check for images as fallback. This is primarily for upgraders: it’ll have no meaning for first-time installers and, in fact, should be omitted from the setup process.

We need a second pref (at least, people who upgrade will have two prefs: those who install will just see the new one) called img_base_path which will hold the full path to the images. viz:

img_dir = images
img_base_path = /var/www/path/to/root/public/images

Have I misunderstood?


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Offline

#6 2021-03-27 02:19:50

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 11,475
Website GitHub

Re: Images and the file system layout from 4.9.0

Been doing some tests. From an algorithm perspective, I think we can get away with this, which is relatively cheap (in terms of speed to calculate) and gives us a good spread of directory->file mappings:

$hash = substr(md5($id), -3);

Basically, if you throw an image id in where it says $id, you get a 3-character hash code spat out. That’s the subdir in which that image’s files will be held.

For the maths geeks, using a 3-char code gives us a maximum subdir count of 4096. Pretty high compared to the 256 maximum subdirs we’d get with just a 2-char code. But, that’s offset by the fact that as the number of images grows, the distribution is way less dense.

Figuring that balancing the sweet spot based on people’s usage and growth will be key, I did some back-of-a-napkin PHP tests.

For 100 images, there’s only one hash clash (id=29 and id=93). So that means you’ll have 100 subdirs beneath /public/images, and 98 of them will contain 1 image id. Remember, that isn’t the same as 1 physical file because each image id comprises:

  • 1x original size image.
  • 1x full-res ‘main’ image (which we’re probably going to default to 1920×1440 – flipped for portrait).
  • 1x ‘grid’ image (which I think is going to be 240×240 – basically a 120px square image @2x – but I’m sure Phil will correct me if I’m wrong).
  • however many other sizes you wish to define.

Thus each image id will be represented (at minimum) by 3 physical files, with the maximum number of files being up to you.

As things scale, the number of subdirs goes up of course – towards the hard ceiling of 4096 subdirs when you reach 37553 images). For someone like towndock’s site with circa 26000 images, if he migrates all his images to the new file system, he’ll have:

  • 4088 three-character subdirs off /public/images.
  • One of those subdirs will contain 17 image ids – so that’s 51 files minimum.
  • At the other end of the scale, 650 of those subdirs will contain just 6 image ids (minimum: 18 physical files). A bunch of subdirs (about 50 of them) will still only contain 1 image id.

Not so bad.

Interestingly, his situation (sort of) improves if we go to a 2-char code:

  • Only 256 two-character subdirs off /public/images.
  • Between 72 and 126 image ids per subdir (i.e. anywhere between 216 and 378 files if you assume just three image sizes per id).

That’s manageable. But if we start to scale beyond a lot more images, things get significantly worse with a 2-char code and stay relatively stable for a 3-char code. For quarter of a million images ids:

  • 2-char code: ~1000 image ids per subdir (i.e. approx. 3K image files per subdir).
  • 3-char code: between 36 and 92 image ids per subdir (i.e. between 108 and 276 image files per subdir).

At a million image ids:

  • 2-char code: ~3900 image ids per subdir (i.e. nearly 12K image files per subdir).
  • 3-char code: tops out at 296 image ids per subdir (i.e. just under 900 image files per subdir).

The question we have to ask ourselves is this: where are we aiming? A million image ids seems like a ridiculous number to manage. The situation is manageable with a two-character code up to around 75K image ids. That delivers around 330 (give or take) image ids per subdir. So assuming you stick with the stock 3 images per id, you’re looking at a shade over 1000 images per subdir. Each image size you add, of course, increases your file count by another 330 images per subdir (assuming you have the memory/resource/time to go and generate the damn things!)

I don’t know enough about inode/MFT file-per-dir or max overall file limits in *nix and Windows file systems. I suspect they’re colossal, but I do know of people running out of inodes with a lot of files per dir, even if their disks are relatively empty, so there is that to consider.

Imagine using an FTP program to trawl a 2-char or 3-char file system packed with images. Would it be better to limit the top-level subdirectories to 256 and have (potentially) a lot more files per subdir when you click to enter each one, or allow the top-level to go up to 4096 subdirs, which might slow down your initial directory read, but then hopping into each directory would be relatively fast with just a comparative handful of files.

I’m happy with a three-character code as I think that gives us a good trade-off but I’d be interested to hear if you think that’s overkill. I’d also be interested if anyone has any other methods of calculating a distributed filesystem hash that’s fast, scalable, and upper bound. Would we be better ditching ids and using filenames, despite the potential filename clashes. And if you rename/replace a file, what then? We’d have to recalculate its hash and move it to a new subdir on save. Messy.

Remember: once we set this in core, we can’t change it later without completely reindexing every image location. No way we can do that on upgrade :-\

And, yes, apparently I really do love doing this stuff on a Friday night. How sad am I?!

EDIT: if anybody wants my code for the tests, just stuff this in a (non-textiled) article and view it. Tinker with the two values at the top to see the effects when you refresh:

<pre><code>
<txp:php>
// Tinker with these two values.
$max_images = 10000;
$code_width = 3;

$out = array();
$freq = array();

// Create hash table.
for ($id = 1; $id <= $max_images; $id++) {
   $hash = substr(md5($id), -$code_width);
   $out[$hash][] = $id;
}

$subdirs = count($out);

// Create frequency table (numfiles => tally)
foreach ($out as $h => $files) {
    $numFiles = count($files);
    !isset($freq[$numFiles]) ? $freq[$numFiles] = 1 : $freq[$numFiles]++;
}

asort($freq, SORT_NUMERIC);

dmp('Images: ' . $max_images, 'Subdirs: ' . $subdirs, 'Images per subdir (min/avg/max): ', min(array_keys($freq)) . ' / ' . round($max_images/$subdirs, 2) . ' / ' . max(array_keys($freq)));

dmp('Frequency table [number of image ids => number of subdirs with this many images in them]:', $freq);
</txp:php>
</code></pre>

Last edited by Bloke (2021-03-27 02:49:42)


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Offline

#7 2021-03-27 07:06:32

colak
Admin
From: Cyprus
Registered: 2004-11-20
Posts: 9,096
Website GitHub Mastodon Twitter

Re: Images and the file system layout from 4.9.0

Bloke wrote #329493:

Not sure I understand your comment, sorry Yiannis.

So sorry for been too laconic Stef,
I do visit the forum 2-3 times/day read through the new posts, but my head-space is too full with a number of projects and other issues which resulted in that unclear comment.

Here is a more detailed comment

Bloke wrote #329464:

There’s a GitHub issue about possibly revamping the file structure into more broad /textpattern, /public and /cms-shared top levels. Not only does this cut down on potential clashes between section names and the file system (the ol’ placeholder directory in the way message) but it means you can more easily clamp down permissions to shared/more private content such as config.php and plugins/tmp dirs. We’ll try and make the migration as painless as possible.

My response was for the point above. The issue you identified here, is the naming of the images and files directories. Both of these could be renamed by users with an option in the installation process. I of course know that they are placeholders in the preferences which means that allowing people to chose the names of these directories would reduce this possibility of clashes. Having said that, my only issue regarding this is the linkage of the images and files externally. ie in newsletters, partner sites, etc. Thankfully, it is not an issue for me, yet, re images but it is re files. An htaccess rule would be prudent to redirect requests from the old to the new directories.

Specifically for images, there’s a comment on that GitHub issue that goes into details of how images will be affected when we enable multiple thumbnails for each image size.

I am worried that for sites with a lot of images, this structure may become unsustainable. Also, who will be deciding the sizes? Will that be in the admin>preferences?

The tl;dr is that we’re considering /your/site-root/public/images/ as an entry point, governed by a new pref co-located alongside the existing image dir pref. Would be nice if we could use the same pref, but I think that’s a stretch.

Under that dir will be some file system that will help balance the following goals:

  1. Find images relatively easily from their ID (we’re not planning on dropping the id.ext syntax, though we perhaps could).
  2. Store multiple images per size, uniquely identifiable by their resolution/size.
  3. Permit different types of file per image. So, maybe a png full size, but a webp 800px wide thumb and a 400px jpg.
  4. Traverse said structure in a sane manner from an FTP program without having to bounce endlessly up and down the dir structure to go between fullsize, 1280, 768 and 400px widths of the same image.
  5. Keep the number of files down per directory so the system can scale to millions of images.
  6. Support offloading of images to a cloud environment (if we can).

I’d like to know how the number of files can be kept down in the proposed /public/images/ directory for those webmasters who do upload a lot of images.


Yiannis
——————————
NeMe | hblack.art | EMAP | A Sea change | Toolkit of Care
I do my best editing after I click on the submit button.

Offline

#8 2021-03-27 07:39:53

colak
Admin
From: Cyprus
Registered: 2004-11-20
Posts: 9,096
Website GitHub Mastodon Twitter

Re: Images and the file system layout from 4.9.0

Here are some relevant links of how others are dealing with this and related subjects

Things that have not been discussed here

  1. symlinks — could be a solution
  2. Dynamic resizing and optimisation of images — I have a problem with this as high traffic sites will stress the server
  3. Original images — A future-proof method would be to upload the original images which could be large in size, with the system creating their web equivalents. As the specs of internet technologies become higher, the original images will be there for any possible redesign. I have no idea regarding the possible extensions, this directory should support:)

Last edited by colak (2021-03-27 08:03:32)


Yiannis
——————————
NeMe | hblack.art | EMAP | A Sea change | Toolkit of Care
I do my best editing after I click on the submit button.

Offline

#9 2021-03-27 10:28:05

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 11,475
Website GitHub

Re: Images and the file system layout from 4.9.0

colak wrote #329495:

linkage of the images and files externally. ie in newsletters, partner sites, etc.

If you hotlink by hand, nothing changes until you save an image, then it’d get moved.

The other way to handle it is to get rid of the second pref idea, and just do a mv (move) on upgrade of /images (or wherever your pref says your images are stored) => /public/images. That will break all hand-crafted image links immediately so I’m hesitant to do it but it would mean slightly less code faff.

Having the images stored in the root of /public/images from day one of an upgrade is attractive in the sense that we can still do the ‘move it to a new location when you save it from the Images Edit panel (or multi-edit).

I am worried that for sites with a lot of images, this structure may become unsustainable.

That’s the point of trying to load balance them a little to spread the risk. Keeps site speeds up, improves FTP / filesystem file handling, but still allows people to (relatively easily) find them by hand if they need them. The image path (and number of thumbs) would be shown in the UI somewhere.

Also, who will be deciding the sizes? Will that be in the admin>preferences?

You (admin) decide, based on your site needs. When you upload one or more images, you’d type in the sizes you want created in a text box near the image upload button. It stashes the original (as you say, for later re-creation if you want) but never uses it after that. It’s just there. Intervention then creates you a bunch of images for your site to use.

The admin side will need some images too. Two, to be specific. One ‘main’ highish res image, which you can use in your designs if you wish – it’s just an image after all – but will be used as the basis for all subsequent image operations like crop, rotate, greyscale, brightness/contrast etc.

The other image is a grid image (square, probably shrunk and auto-cropped). Again, if that fits your front-end website needs by all means use it. These initial sizes will be governed by an advanced pref. If you don’t like those sizes or would prefer different defaults – perhaps you have a monster server and can handle processing immense images, or you don’t mind the wait as it crunches things – expose the pref and change them.

The key thing about all these images is that they will be created dynamically. Well, there are edge cases and I haven’t figured out how to do that yet. I’ll get to that later.

On upload, Intervention will create the ‘main’ image for use in the admin side and as a basis for future resize/crop/etc operations. This is purely for sanity as going back to a 15 megapixel or higher camera image each time is memory/time intensive. So we’ll take the hit on the slight loss of quality. Not sure what will happen for existing images. I guess we’ll have to ‘assume’ that the one there already is the highest res copy. Not ideal but if you want to alter that you’ll need to edit each image and re-upload a higher res copy to recreate the rest.

If you ever visit the Edit panel and there is no default image size file in the filesystem, we’ll instruct intervention to go and create one from the original on-the-fly. That’ll delay loading of the panel but there’s nothing we can do about it. If the high-res working copy exists, it’ll be used straight away: no additional delay.

The same goes for the image list/grid panel. If you view a list and there’s no grid size created for an image, we’ll probably:

a) Look for the working copy. If it exists, resize + crop it to a grid size image.
b) If no working copy exists, go back to the original, create a working copy, then (from that) resize+crop to make the grid size image.

That covers our bases. No missing images will ever be shown because if you delete them, they get recreated.

These two ‘system’ sizes will be avai;able in the image edit panel. Any additional size images you require for your site will be up to you to define/create. When you use a textbox to indicate which sizes you want (either in the upload panel or edit panel) the values are remembered for next time so when you do any subsequent operation, you get all the image sizes you need to support your design.

If you go into the image edit panel and there are missing images from the set in the text box, they will be auto-created (from the working copy) at that point. Shouldn’t be too much delay here as the working copy is smaller. Plus the resize operations can be batched off the same in-memory copy. The delay comes from loading that image into memory in the first place: the resizes are comparatively rapid.

You’ll be able to operate on individual images here in some fashion (maybe a dropdown selector of sizes) so you can individually replace an image for art direction purposes or whatever you need. Maybe you want to recreate a thumb from the working copy and do a different crop on it. That’s cool.

You’ll be able create new sizes when you save via a textbox/checkbox/something near the save/apply button. So when you exit the panel, you have the option of cascading any changes you made to the current image to all thumbs in whatever sizes you want. And this set of sizes will be remembered for the next image so you can more rapidly change them in sequence if you don’t have too many to process.

Some of that might not quite work yet – I haven’t figured out the details, so if there are things that don’t work out, none of the above is gospel yet. There may need to be some compromises made.

I’d like to know how the number of files can be kept down in the proposed /public/images/ directory for those webmasters who do upload a lot of images.

Up to you. When you save from the edit panel, we can offer a ‘clear all other size of this image’ checkbox? That would blat any existing images and recreate the smaller versions
from the working copy. If you didn’t do that, then if you start tinkering with image sizes, yes, the number of files will grow.

In the List panel I would like to have a column that indicates how many images there are of each image. Not sure how feasible that is, but it’d be handy so you can sort by it to see if there are any assets that are missing.

It did cross my mind that it might be nice to allow things like the <txp:images> tag to atuto-create images if they’re missing. That sidesteps a whole raft of issues by redirecting image requests to a hunk of PHP to return images of the needed sizes/types based on your srcset.

The downside? A bot or unscrupulous person could tinker with the URL and start mass-creating images at a bajillion different resolutions, flooding your filesystem.

There is a potential avenue in this vein we could explore though: what if we allowed images to be auto-created from the front-end but only if you’re logged in and have sufficient rights? That would be a fabulous way of making images.

Say you have a gallery. You need 3 sizes of image. You craft your responsive image tags in your template to require 3 or 4 sizes/type of pic for various devices, and visit the gallery page as a logged-in user on your front-end website. A bunch of assets would be missing. Intervention would go and create them from the working copy. As you resize your browser or visit on other devices, more would be created. Those first few page loads would take a little while, but after that, everything is effectively ‘cached’. You visit an individual article: same deal: missing assets are created on-the-fly for you.

This helps if you ever need to clean up the site or you decide you want a different set of sizes for whatever reason. You can just do a find/delete on any images in the file system that aren’t suffixed ‘_original’, browse your site and the new pics will be auto-created at the dimensions you need based on your tags.

That’s kind of attractive. Maybe it could be opt-in, dunno.

Lots to think about and I’m still chewing things over. Central to it is storage in a sane manner that allows you to keep the number of files under control but also to permit flexibility for responsive site designs and the needs of the back end to help you administer them.


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Offline

#10 2021-03-27 10:30:47

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 11,475
Website GitHub

Re: Images and the file system layout from 4.9.0

colak wrote #329496:

Here are some relevant links

Thank you! Love that one about how they approached arbitrary cropping. I’m probably gonna use that idea.

recommends to have the images organised in semantically named sub-folders. This is my favourite method (categories in txp language) but I do also understand how this could create 404s should one decide to change the category of an image.

Yes, sadly it’s a real pain to manage as it involves heavy file operations when changing one (or more: multi-edit) images.

symlinks — could be a solution

Not for Windows, sadly, so we have to avoid them.

Dynamic resizing and optimisation of images — I have a problem with this as high traffic sites will stress the server

See above.

Original images — A future-proof method would be to upload the original images which could be large in size, with the system creating their web equivalents.

Exactly. This is how we intend to do it.


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Offline

#11 2021-03-27 10:44:24

jakob
Admin
From: Germany
Registered: 2005-01-20
Posts: 4,752
Website

Re: Images and the file system layout from 4.9.0

I’m finding it hard to formulate an answer to this long and complex topic so apologies for no input as yet.

Two (probably a bit awkward) things to consider, though:

  • I very much like the fact that there’s a predictable image url that one can reconstruct easily. It’s so much easier to find the image in the file system, as well as the smd_thumbnail versions of them. WP can be infuriating when the image is buried away in some month in some year (sometimes I get the feeling the same image in different sizes can be strewn across several locations). It also makes it possible to manually port over large quantities of content to textpattern from some other (legacy) site without having to go through uploading to textpattern. Once hashes get introduced into the mix, it gets much harder (you helped me with my 700+ hashed images last week :-).
  • When updating other people’s sites, I still see a lot of instances where people have !/images/234.jpg! in their page content. Even when people are using a shortcode, manually constructed image_urls break when a computed hash gets introduced into the mix.

TXP Builders – finely-crafted code, design and txp

Offline

#12 2021-03-27 11:00:28

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 11,475
Website GitHub

Re: Images and the file system layout from 4.9.0

jakob wrote #329502:

I very much like the fact that there’s a predictable image url that one can reconstruct easily.

Me too. I was originally working on a date-based format and, while it works, it’s a real pain to just find anything you want, as you say, since it relies on you knowing when the image was uploaded. And if it was subsequently replaced/modified, the datestamp of the original might change and then you have the dilemma of moving it to a new date subdir, or leaving at and having the two not match up.

Once hashes get introduced into the mix, it gets much harder

That is true, and a downside. The saving grace is that, since the hash is incredibly simple to administer, a parallel process of taking your existing images, batch renaming them to sequential numbers and passing each through that hash process will spit out the directory structure you need. Thus you could pre-prepare content that way, upload them en-masse and simply point Txp at this set.

It’d be ace if Txp would see them and go ‘woah, there are images here that aren’t in the DB… shall I import them?’ Like we have that option in Files to create from assets on the server but not in the DB. Something like that, but maybe in bullk. Maybe something to think about in future (an image import plugin, perhaps?)

When updating other people’s sites, I still see a lot of instances where people have !/images/234.jpg! in their page content.

Yeah, that is annoying. We could do with some way of redirecting those to the new system. If the resulting URL could be parsed, its ID /extension extracted, we could then throw it at the imagesrcurl() to get the new location.

Open to ideas here.


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Offline

Board footer

Powered by FluxBB