Images and the file system layout from 4.9.0

Bloke · 2021-03-28 12:07:45

phiw13 wrote #329537:

one thing I don’t like is a date-based system… that gives “labyrinth” a bad name

That’s my experience with it too. Although I started out wondering if we could do it better, I’ve not yet found a way to use date-based dir structures in a sane way that helps someone who’s tearing their hair out shouting where the hell is this particular picture stored?

I am thinking “name” here not so much for administrative management (DB side) as the user friendly part

Using name would be nice if we could solve the duplication issue. i.e. what happens if you replace an image file with one where you’ve changed the filename. We already have that (awful, imo) disconnect on the Files panel where you can upload sales_presentation_v1.pdf and serve it; then, later, click to edit, choose File Replace and swap it for sales_presentation_v2.pdf. The file is uploaded and renamed to sales_presentation_v1.pdf so you can still access it from the same name. That’s:

Good that you don’t get a broken download link.
Bad that the version indicated in the file name doesn’t match the actual file content.
Bad that your browser might not realise it’s a new version and continue to serve the old one – and there’s no neat way of ‘force refreshing’ a file_download. The usual SHIFT+REFRESH doesn’t always work because it’s a redirect.

Adopting that kind of thing with images fills me with trepidation. We don’t have the refresh issue so much (because images are usually served directly and are more easily refreshed) but having bad filenames that don’t quite represent their content because you’ve decided to improve its name is worse than using a faceless ID+metadata, from an SEO standpoint: potentially misleading.

Filesize tend to be much larger… Quality, as in for certain types of images you have lots of unsharpness along edges and border, odd colour shifts

Yeah if you’re working with fine detail images with sharp edges, auto-creation can be quite horrible. That’s why I’d like to retain the ability to replace individual thumbs easily from the UI. And permit you to swap them with versions of different type.

Perhaps the ‘filter’ controls are a step too far in this regard. Maybe you only see those if you choose the main ‘working copy’? If one is eschewing ImageMagick/GD completely anyway and creating thumbs by hand, you won’t be using the automatic conversion anyway during upload. Adopting a workflow like you suggest with smd_thumbnail makes perfect sense here. Create the thumb by default (for the List/Grid view), then switch on the profile(s) you want to edit, click into the image, upload your thumbs, then turn the profiles off after until you need them again.

To support this workflow here in core:

Leave everything at their defaults. You’ll get an original, a working copy and a grid thumb created by the system.
Click to edit.
Use some kind of ‘add thumbs’ feature to indicate you want some additional thumbnails for this size image. That adds them to the dropdown so you can select them, then…
… click browse to load an image into that slot. Repeat for each thumb.

Bonus points could be had here if we can find ways to intelligently solve these use cases:

Ability to add more than one thumb at once, e.g. from the edit panel, permit selection of multiple images. Perhaps you could select the entry at the bottom of the dropdown that says “New sizes” and then Browse. Select 1, 2, 3… whatever number of images you want as thumbs and upload them. Txp creates a thumb from each of them and uses the dimensions of each image it reads as the sizes available in the dropdown, adding them automatically. I like that.
From the list panel, the ability to upload more than one version of the same image. At the moment, each image of a multi-set is assumed to be a discrete image. If you’ve gone to the trouble of prepping a set of them externally, it’d be nice to have a radio/checkbox alongside the Upload button that says “Treat these pics as one image set”. So if you pick 4 images when you upload, instead of creating 4 discrete new IDs, it creates one ID and uses each of the selected images as different size thumbnails (plus auto-creating working image and grid for Txp’s use if the dimensions don’t match any of the incoming files). Again, as above, the sizes can be read from the incoming files to create the ‘profiles’ for that image on the fly.

In the latter case, we could actually make it so that if you indicate all the images you’re uploading are a set, it switches you straight into the Edit panel after upload. So you can add the metadata alt/caption immediately. But if you state all the images are independent, it leaves you on the List panel as it does now.

File storage aside, I think there are some nice wins to be had here to allow people who are serious about not using ImageMagick/GD to bypass them and use their own for a superfast workflow.

Whether this is enabled in core or we expose the image system in a better way via callbacks so a plugin can augment it is something we can decide as we iterate things.

Last edited by Bloke (2021-03-28 12:25:42)

philwareham · 2021-03-28 12:36:01

Might be some useful ideas here, if you’ve not already read it.

I think each image ID is going to need it’s own directory to house it’s images and the temp directory for it’s work files. How we structure the directories above that is the conundrum. Looks like a 32,000 sub directory limit per directory level needs to be heeded so definitely a way of breaking the directories down into small chunks has to be solved.

Bloke · 2021-03-28 15:01:25

Good read, thanks Phil. I like that one about using the last 4 digits/chars and splitting on that. MIght be simpler than a completely random hash, although the first 999 files may need special dispensation. Haven’t checked his code yet and seen how it scales. I’ll investigate this week.

philwareham wrote #329546:

each image ID is going to need it’s own directory to house it’s images and the temp directory for it’s work files.

One subdir per id: yes, we could do that. Or a few images in one. Prefixing each image with its id means files can co-exist happily in a dir. And they’re grouped naturally when sorting so you can see all the files related to a single image ID in FTP/file browsers, which is a bonus.

Keeping individual temp dirs is attractive from a compartmentalisation viewpoint but does make it harder (potentially) to clean up by hand if you start getting files floating around due to crashed browser sessions or using the browser’s Back button rather than ‘Cancel’ (which can clean up after itself).

I don’t think a single well-known temp dir is too big a deal. You’ll only usually work on one image at a time anyway; even in a multi-user environment there’ll only be a few of you potentially working on images at any one time so the dir won’t get crowded. Just like with regular images, prefixing temp files with the ID being worked on will make it simple to locate ones that belong to the same editing session and blat them if necessary.

One thing we do need to be mindful of is that, under *nix, adding a temp dir to each subdir uses one extra inode for something that is only occasionally used. And inodes are a precious commodity. It’s entirely possible (even today, I believe: but I’m willing to be proved wrong) to run out of inodes despite there being a tonne of storage space left on your drive.

The same can be said for creating a subdir of its own id. We could do that, but for every id, there’s one inode used for each physical file size, plus an extra inode for its /id subdir. Back of a napkin example:

8000 image IDs.
5 responsive sizes, including system thumb and system working copy = 40000 inodes.
Number of subdirectories to house these files if using the hash method I mentioned above, with a three-char code: 3528 (=43528 inodes).
One extra subdir per id to house its own files: 8000 (= 51528 inodes).
One extra subdir per id for dedicated temp files: 8000 (= 59528 inodes).

That’s an extra 8K or 16K inodes used up for no (or very little) additional storage value. So, imo, the flatter the structure the better.

How we structure the directories above that is the conundrum.

Absolutely. That’s what I’m trying to solve in as sane and simple manner possible to limit the maximum dir space (so there’s no capability for there to be tens of thousands of dirs directly under /images) and keep a reasonable number of files within each of those. And keep that structure as flat as possible.

Last edited by Bloke (2021-03-28 15:04:25)

zero · 2021-03-28 16:43:30

Bloke wrote #329544:

Because there are people with thousands and thousands of images.

What if you designed the filesystem for people who use less than your (the developer’s) ideal maximum number of images? Then, for those with more images, create a plugin that can create filesystem 2 in parallel with the basic filesystem. And want more? Do it again.

I haven’t got a clue if this is possible, but if the title of each filesystem could be changed by the user to suit their needs, and it was only a click or two to setup, it could be convenient enough for power users. And cut down on the complications.

ax · 2021-03-28 16:48:27

Bloke wrote #329547:

there’s one inode used for each physical file size, plus an extra inode for its /id subdir

Actually you can exceed the number of inodes when storing images, and I may be the only one in the forum who ever experienced this “no more inodes” error message when your device is only 10% filled, but your inodes pool has been exhausted. This happened with an 4.1T device and about 600m images (from converting > 10.000 raw images to a pyramid image format, resulting in 50.000 tiled images or more per raw image in hierarchical subdirectories).

Therefore:

for practical purposes, counting inodes is irrelevant, unless you store pyramid images
in that case, a non-inode based file system should be used (I ended up with ReiserFS and btrfs)
hierarchical subdirectories can be useful for hosting images, including deep zoom images, and only in this rare instance an ext4 file system can be pushed to its limits.

Bloke · 2021-03-28 17:32:49

ax wrote #329549:

in that case, a non-inode based file system should be used (I ended up with ReiserFS and btrfs)

Good to know there are options, thanks. Assuming whoever runs into the problem is using a hosting platform (or doing it themselves) that allows them to freely choose a filesystem.

zero wrote #329548:

What if you designed the filesystem for people who use less than your (the developer’s) ideal maximum number of images? Then, for those with more images, create a plugin that can create filesystem 2 in parallel with the basic filesystem. And want more? Do it again.

Interesting. Wouldn’t even need a plugin, I expect. We could introduce a tiered structure like this:

/path/to/images/
-> vol0001/subdir/image files
-> vol0001/subdir/image files
-> vol0001/subdir/image files
...
-> vol0002/subdir/image files
-> vol0002/subdir/image files
-> vol0002/subdir/image files
...

If we limited the subdirs to the 2-character code system (or something of that ilk) then vol0001 could comfortably house up to, say, 50K image IDs. That’s going to cover 99.9% of users I expect. That has a maximum of 256 subdirs, so not onerous on FTP programs either.

Fully loaded, that’ll deliver an average of about 200 image IDs for every subdir. So if you had 5 image thumbnail sizes per ID, that’s 1000 files in each subdir. Not ideal, but certainly an FTP/file browser can handle that reasonably well.

So if we internally hardcode the ‘click over’ point to be 50000 image IDs, then any ID value lower than that is in vol0001. From image ID 50001 – 100000, they’d appear in vol0002, and so forth.

Now, while this doesn’t limit the number of volumes that can be handled – it’s open-ended so you could run into problem down the line – you’ve really got to be going some to stress it:

With 10 volumes you can handle half a million image IDs.
With 100 volumes you can handle 5 million image ID.
With 1000 volumes – where things might start to slow down if you FTP to /images – you can handle 50 million image IDs.

That’s not physical files, that’s image id values. Based on the average user having 5 files per image id (1x original, 1x working copy, 1x grid, and 2x custom sizes), that’s 250 million actual files.

If you were prepared to wait a bit for your initial directory read (once you choose your volume and enter it, the performance would return to normal), then with 10000 volumes, you could stretch your Txp to support half a billion image IDs – which is 2.5 billion image files. That exceeds the system (by a factor of ~4) that ax found was where the inode count started to struggle, so I’m kinda happy that we can outstrip the inode table without too much degradation at a shade over 2000 volumes.

Thus, for all intents and purposes, we can support unlimited images as you’re more likely to hit the filesystem limits before you reach Txp’s limit.

Maybe that’s the way to go. It’s certainly scalable. And if you want to know which volume an image is in, check its ID, divide by 50000, then round up. And within that volume, you can find its subdir by again taking the id and doing some operation on it (to be decided but maybe just the last two digits will be good enough – I’ll have to check and see what the distribution is like).

That’s certainly given me something to ponder, so thank you both.

Last edited by Bloke (2021-03-28 17:41:38)

giz · 2021-03-28 18:28:37

I know I’m repeating myself, but SLIR leaves the /images folder and the images tab alone.

Further you don’t need to set up your various image sizes before hand, as it is all handled on-demand via the code.

Bloke · 2021-03-28 18:46:25

I looked at SLIR. Wasn’t quite sure how to set it up so that flexible sizes could be generated by admins from the UI. It seems to need a set of sizes and transforms and returns the images.

Hooking it up to the front end tags would permit denial of service or file system flooding but if the generation was limited to logged-in users then it might work.

Also need to see how it plays with hard coded thumbs uploaded by users, and how it scales.

I suspect it will be impossible to import images en-masse using SLIR because the cache is internally managed. So you’d have to generate the sizes afterwards. That might make manual processes a little heavier.

I’ll give it a play in a separate branch and compare it to Intervention. Thanks for the nudge.

philwareham · 2021-03-28 19:45:37

SLIR is end of life. Even the author recommends Intervention Image-based alternatives (which is the library we are adopting). Also see github.com/ambroisemaupate/intervention-request.

However that is already provided by Glide (specifically Flysystem) that we’ve looked at too (another Intervention Image-based library), which IMO handles this at scale better.

Bloke · 2021-03-28 20:11:38

Ah yeah forgot about the state of SLIR dev. Okay scratch that.

giz · 2021-03-29 19:34:41

While I’m all for progress, I veer away from new and shiny software until it has advanced to a level where anyone can simply copy it into a directory and it works (and stays working¹).

SLIR won’t be end of life until it stops working. Maybe future changes to php will cause it to crap-out, but then it is in the same boat as everything else; if the software is useful, someone will patch it.

Bloke wrote #329552:

I looked at SLIR. Wasn’t quite sure how to set it up so that flexible sizes could be generated by admins from the UI. It seems to need a set of sizes and transforms and returns the images.

Hooking it up to the front end tags would permit denial of service or file system flooding but if the generation was limited to logged-in users then it might work.

Also need to see how it plays with hard coded thumbs uploaded by users, and how it scales.

I suspect it will be impossible to import images en-masse using SLIR because the cache is internally managed. So you’d have to generate the sizes afterwards. That might make manual processes a little heavier.

I’ll give it a play in a separate branch and compare it to Intervention. Thanks for the nudge.

It doesn’t need setting up after its installed; if you want to resize/crop an image, you supply the sizes to SLIR using the image’s url only when its needed.

Caching is automatic; active cached images are maintained, old/unused cached images are binned.

Not too sure what you mean by a hook-up to front-end tags / denial of service etc, or importing images en-masse… We appear to have completely different takes on how it might be used (or how any such tool could be useful in a Textpattern context).

User imports scores / thousands of new images into Textpattern. They change / replace a few hundred of the existing images. SLIR isn’t involved. Configuring sizes and crop ratios is the realm of the designer / developer, and to my mind has no place in the images tab – it belongs in the page templates and forms.

We can leave the whole art-direction / responsive image sizing groups / redundant image sizes / etc aspect out of the image tab entirely, and leave it for image originals (and a single optional thumbnail) only.

¹ I’ve used SLIR for 10 years on every website I’ve developed, some big, some small. I kid you not that it is the only! software I’ve used that has never skipped a beat, never required fixing, never required maintenance.

michaelkpate · 2021-03-29 20:47:27

giz wrote #329568:

Not too sure what you mean by a hook-up to front-end tags / denial of service etc, or importing images en-masse…

What steps will reproduce the problem?

Attacker selects a link of an image.

Attacker changes the set dimensions of the link so the script generates a
new image instead of using cache. Making a program that spam http links with random generated dimension numbers in the link can easily be done in any programming language.

Server runs out of resources/crashes/gets disabled by host etc!

source: Security problem, attacker can drain server resources.

Bloke · 2021-03-29 21:56:39

giz wrote #329568:

We appear to have completely different takes on how it might be used (or how any such tool could be useful in a Textpattern context).

Not at all. We’re on the same page. I think images should be created when they’re needed and cached, and…

if you want to resize/crop an image, you supply the sizes to SLIR using the image’s url only when its needed.

… this is the bit that concerns me. Because when it’s needed in this context is when someone visits the front-end website, right? So you fashion a tag to fetch an image or three (at varying sizes) and the system goes to fetch them. It says, oops, I’ve only found one so I’ll create the other two and cache them. Is that right, or have I misunderstood?

If that’s the case…

Not too sure what you mean by a hook-up to front-end tags / denial of service etc

… then (as michaelkpate states above) all it takes is a bot/user with a grudge to use the DOM or write a script to loop over image sizes from 1×1 to (e.g.) 1280×1280 and every size in between for every image it can find on disk and fashion URLs to ask for all those images. Suddenly, your cache explodes and you have a bajillion image files.

That’s the reason I’m thinking about configuration and restricting image creation to either admin-side actions or to use the front-end tags exactly like that but only permit creation for logged-in users. Reduces the attack surface. I have no idea how easy that’ll be to achieve, but it’s something I want to solve.

If SLIR or whatever equivalent tool (Glide, etc) doesn’t work like that and is not exposed to everyone, then everything’s fine.

importing images en-masse…

If it was as simple as dumping images in Txp’s /images directory that’d be wonderful. But if you’re importing images from another (e.g. WordPress) system and you’ve exported the content, you need to maintain the integrity of article->image mapping(s). And Txp uses IDs, which are not known until you add them to the database.

So my only notion was that, whatever file system layout we intend to use for images, would benefit from being deterministic outside Textpattern. So you can rename your images and assign them an ID as you loop over your about-to-be-imported articles and put those files in the appropriate location so that, when you upload the images as import the XML files containing article data into the database, your mapping is already set up. Your images are exactly where Txp expects them and it can go about its business without you needing to manually go through and drag them into the interface.

After that, sure, if there’s auto-thumb creation and caching and on-the-fly image size creation, that’s great. Doesn’t matter. It’s all handled. It’s that initial mapping I want to be deterministic and easy to generate by hand so we’re not hamstrung by a fiendishly complicated array of directory names and numbers.

Our current system of dropping all images and their thumbs into a single directory is simply not good enough. When you get past a few thousand, file systems start to struggle. Any need to back up your file system to dump it to a disk requires shell access – FTP is painful to wait for it to read a dir of 15K+ images – and not everyone on their hosts has the luxury of a shell.

Configuring sizes and crop ratios is the realm of the designer / developer, and to my mind has no place in the images tab – it belongs in the page templates and forms.

No argument here. Your template is where you design what you want. The system needs to build images on the fly to fit your design. You upload a single, high res image and that is never touched or altered in any way. BUT Txp needs:

1x image that we use to display on the Image Edit panel when you click to tinker with the metadata.
1x image in order to display a thumbnail/grid.

In both cases, using the full-size is too slow. So we need to hard-code another size. We’ve tentatively chosen 1920 × 1440 for the larger image and something like 240×240 for the grid. Still deciding.

If you view the image list/grid or click into the admin interface’s Edit panel, we can generate these on-the-fly via SLIR (or equivalent: in this case Intervention/Glide). But if you haven’t gone into the Edit panel – you’ve just uploaded the original and jumped to another panel – then we’d be forced to let some image auto-creation tool try and crunch down the sizes it needs from the original which may be 20MB and 8000 × 6000px resolution, or higher.

That’s not only going to hurt performance – even if it only has to do it once and cache all the smaller versions (unless the cache is flushed and it needs to do it again) – it’s a potential out-of-memory situation waiting to happen. It affects stability.

So, what we’re planning is that when you upload the original, we’re going to auto-create this working copy (for want of a better phrase) right then. Pass the original to whatever tool we choose to automatically make this image size so we know we have it cached. And then any future on-the-fly image creation can take place from this working copy. Much, much faster. Less chance of running out of memory when you view a gallery, for the price of a few extra seconds of wait time at upload.

If we can create both versions from the same in-memory copy using the library, great:

SomeLibrary::make('/path/to/massive_image')
    ->resize('1920x1440')
    ->save()
    ->resize('240x240')
    ->save();

That’d be ace. If it’s not possible to chain like that it might need to be done this way:

SomeLibrary::make('/path/to/massive_image')
    ->resize('1920x1440')
    ->save();
    SomeLibrary::make('/path/to/working_copy_just_created')
    ->resize('240x240')
    ->save();

We’ll lose a bit of quality but hey it’s only a thumbnail for use on the back-end. People’ll get over it. But if we can keep the quality up then there’s no reason people can’t use this 240×240 in their own grid on the front-end. And the working copy too as part of their super-responsive desktop experience. The cache is there: use it.

Any new images, auto-created as needed by the design, are generated from the working copy. The original is pristine; there just as a reference in case you ever want to go back and recreate a high quality working copy at a different resolution. Maybe to rebase your responsive images when a new display density device is released.

We can leave the whole art-direction / responsive image sizing groups / redundant image sizes / etc aspect out of the image tab entirely, and leave it for image originals (and a single optional thumbnail) only.

Sure, we could. But if the cached image has been made already by some auto-creation tool, why not acknowledge its existence in the back-end? Why not show people which ones have been created? If the cache is under some other system’s control – in your case, SLIR – then this is impractical. We have no idea how that cache is managed and auto-purged so we only know the definite location of original+working copy+grid thumb.

But, if the tool we choose allows us to dictate the cache location – and Intervention can do this, if we don’t defer caching to its optional Laravel component – then we have an opportunity to offer art direction and image manipulation on any of the auto-generated thumbs for free. Pick one to operate on and manually upload a better pic, or crop it, or recolour it, or whatever. If we choose to permit this. We might not.

By not auto-flushing the cache, you might get a few sizes left behind that aren’t used. The UI can show you how many sizes there are for any image and you can choose to delete the ones you don’t want right there from the interface. A multi-edit option could allow you to delete all images matching a particular size. If there are too many and you have command line access, go nuts and find | rm {} -\.

But if the file system can cope without grinding to a halt, leaving unused files around won’t hurt and you can tidy them up when you feel the need. If we leave the file system as it is today and start generating additional image sizes in the same directory – even in individual folders directly under /images – we are going to stress the server and slow down the site at some point in some applications. Not everyone, and it might only be a handful of sites, but it will affect them.

tl;dr: I’m totally with you on auto-generation. I just want to make sure that the tool we pick can grow and not hurt the performance we’ve strived hard to maintain when image-heavy sites reach 10-, 20-, 50-thousand images…

giz · 2021-03-30 00:23:29

Thanks; I’ve learnt a lot!

Wouldn’t it be possible to restrict calls to the script i.e. only accept urls generated from the site itself?

phiw13 · 2021-03-30 02:50:27

Bloke wrote #329575:

Lots of interesting thoughts, but this little tidbit worries me:

BUT Txp needs:

1x image that we use to display on the Image Edit panel when you click to tinker with the metadata.

1x image in order to display a thumbnail/grid.
In both cases, using the full-size is too slow. So we need to hard-code another size. We’ve tentatively chosen 1920 × 1440 for the larger image and something like 240×240 for the grid. Still deciding.

I do hope that at least the larger size image is configurable or optional (and just fetch / use the image the user has just uploaded).

landscape mode only, other types of images are cropped ? That does not work for me at all.
the size is larger than most images we ever upload
thumbnails: maybe a square cropped image can be used. For most sites I have worked, a grid of images is about completely useless based on the mockup Phil had posted once. More that the image, the meta data is important for us. Additionally, a square crop may cut too much valuable info out of the image for the thumb to be marginally useful.

Textpattern CMS support forum

#31 2021-03-28 12:07:45

Re: Images and the file system layout from 4.9.0

phiw13 wrote #329537:

#32 2021-03-28 12:36:01

Re: Images and the file system layout from 4.9.0

#33 2021-03-28 15:01:25

Re: Images and the file system layout from 4.9.0

philwareham wrote #329546:

#34 2021-03-28 16:43:30

Re: Images and the file system layout from 4.9.0

Bloke wrote #329544:

#35 2021-03-28 16:48:27

Re: Images and the file system layout from 4.9.0

Bloke wrote #329547:

#36 2021-03-28 17:32:49

Re: Images and the file system layout from 4.9.0

ax wrote #329549:

zero wrote #329548:

#37 2021-03-28 18:28:37

Re: Images and the file system layout from 4.9.0

#38 2021-03-28 18:46:25

Re: Images and the file system layout from 4.9.0

#39 2021-03-28 19:45:37

Re: Images and the file system layout from 4.9.0

#40 2021-03-28 20:11:38

Re: Images and the file system layout from 4.9.0

#41 2021-03-29 19:34:41

Re: Images and the file system layout from 4.9.0

Bloke wrote #329552:

#42 2021-03-29 20:47:27

Re: Images and the file system layout from 4.9.0

giz wrote #329568:

#43 2021-03-29 21:56:39

Re: Images and the file system layout from 4.9.0

giz wrote #329568:

#44 2021-03-30 00:23:29

Re: Images and the file system layout from 4.9.0

#45 2021-03-30 02:50:27

Re: Images and the file system layout from 4.9.0

Bloke wrote #329575:

Board footer