Images and the file system layout from 4.9.0

philwareham · 2021-03-28 08:29:04

And regarding file directory – again why over complicate. We have unique image IDs already so let’s keep the original image (and legacy thumb) in the images directory, and for each image id also have a sub directory in there labelled with image ID and within that a temp folder for the work images plus the image at various sizes – either loose or again within subdirectories.

That keeps a level of back compat too.

phiw13 · 2021-03-28 09:56:02

Bloke wrote #329513:

Yeah. It’s not ideal. I’m open to better ideas.

I am not sure what to suggest ;-). Something ID + name based, along the lines suggested by Vienuolis might be more viable – ID being the the ID of of the main, source, images and all derivatives (and alt. formats such as .webp or .avif) are link to that one ID and – on the filesystem – stored in the same location.

one thing I don’t like is a date-based system like the one used by big brother CMS-that-cannot-be-named. That gives “labyrinth” a bad name and is, as noted by both Jacob and you above, rather difficult to use over time.

that said, I never had to deal with (many) thousands of images. At best, a couple of thousands, stored in various locations on the filesystem.

Sure, if we can. ID values are attractive because they’re multi user compliant and easier to administer when names might change. But I do like names if we can come up with some suitable scheme.

I am thinking “name” here not so much for administrative management (DB side) as the user friendly part, similar as we have “name” and “title” for categories / sections/ … Another benefit might reside in less difficult management of multiple formats of the same image.

(regarding processing images through tools like GD
Noted. You mean in general, from a quality standpoint? Or performance?

Filesize tend to be much larger (it does not help that on the web you are already limited in working with lossy images, Safari on my Mac is the only one that can handle TIFF files). Quality, as in for certain types of images you have lots of unsharpness along edges and border, odd colour shifts,… For the qua;lity aspects, it of course depends what type of images you handle. Landscape, large groups of people, buildings, general views are easier than things like product images (and object in front of a blurry background), screenshots and line-art (maps, plans,…) portraits…. And as it is those are the images I deal with 90% of the time

I’m conscious to make it easy to create/replace images with your own. Need to give this more thought.

The tools as available in your smd_thumbnails are very useful for me. As it is, when first uploading an image, I don’t bother anymore with turning the needed profiles, only the “defaut” profile is “on“. After that I can turn on some profiles and manually manage (upload) for each individual image.

etc · 2021-03-28 10:31:30

Bloke wrote #329532:

If the filesystem uses dates, people won’t be able to do that. They won’t know which date Txp is going to use: The file creation date? Its modified date? The date the file is uploaded? The server date or local system date?

Oh, y/m/d is not meant to be part of data, just a human-readable path. It could be a/b/c as well. No need to replace it on update, y/m/d/id becomes kinda immutable image id.

But again, I have no experience with large image collections management.

colak · 2021-03-28 10:50:01

Would it be prudent to give people an option? We are currently discussing about which method we should set in stone, but some may prefer ids, y, y/m, y/m/d. I’m just thinking that some options like that, may prepare the ground for other ones which we cannot think of just yet.

Bloke · 2021-03-28 10:59:22

colak wrote #329533:

Sorry to be a pest, but with operate on just that thumb, you mean using the web interface where the Browse, Reset, Upload button will be available.

Yes. A dropdown containing:

original
1920 x 1440 [<<default when the panel loads]
800 × 600
400 × 300
240 × 240
…

You pick one, it loads it from the filesystem into the viewing area. You do something to it, e.g. replace it (or crop or colorize/adjust, maybe: not sure yet).

Having the tools available for each size/version of the image would be excellent for the picture tag

Yes. For art direction. We have to weigh up:

The convenience and simplicity – some might say rigidity – of being able to only change the working copy (the main image cloned from the original) so all its thumbs always stay in sync with it, whatever you do. Thus art direction has to be done by either a) uploading a completely separate image – with a different id – that you’ve pre-processed offline, or b) you manually upload pre-prepared (cropped/whatever) thumbs to replace particular thumbs in the current image ID set.
The flexibility – but potentially more complex UX – of being able to operate on any thumb independently of the working copy via the UI. This permits one (or more) particular sizes of the same image ID to be visually distinct, without having to do it offline first.
Something in between these two extremes. I’m conscious that if we force regeneration of thumbs when you change the working copy, anyone who’s gone to the trouble of uploading their own thumb versions is going to be cross having to re-upload their handiwork.

The first option makes <picture> tag construction harder because you’re dealing with (potentially) two or three image IDs to construct one responsive set of images. So we either need to rethink the <txp:images> tag to be able to handle this better or introduce a new tag specifically for <picture> which makes it easy to mix and match multiple images in one HTML tag.

The second gives you the freedom to crop the 400×300 version tight on the subject and when you construct your <picture> tag you can do it with a single <txp:images> wrapper.

To be fair, you could supply a list of id values to the <txp:images> tag if we forbade editing on any thumb, but your container becomes ugly: you need to detect which image you’re working on in the list, and do something different with it.

Here’s a quick mod to Phil’s mockup to show what I’m thinking. I don’t particularly like that I’ve moved the effects panel to the bottom but I’m not a designer, and it does have the benefit of pushing the image itself up the screen a little.

I’m thinking that if you alter the size dropdown to load a different image in for editing, your current operations / undo history are discarded (maybe a warning is shown). So it’s not super complicated to keep a history, as it’s only for one image at a time.

Haven’t thought about scaling options yet, so ignore that bit. The only other tweak I made to the Filters panel is that I wondered if it might be nice to combine blur/sharpen into a single slider. Center is ‘Nothing’, drag left to blur, right to sharpen. They’re mutually opposable operations, right? You wouldn’t blur and then sharpen independently (or vice versa), would you?

Last edited by Bloke (2021-03-28 12:14:40)

Bloke · 2021-03-28 11:35:58

philwareham wrote #329536:

And regarding file directory – again why over complicate.

Because there are people with thousands and thousands of images. Wedding photographers like Ross Harvey have 8k+ photos in that one dir. That’s 16k files including thumbs. It’s painful to manage images if you ever have to dive in and manage them via a file browser like FTP to the host (which I’ve had to do when sites are hit by hackers). And I know of systems that have tens of thousands of images – double that when including core thumbs.

Although there are many sites that don’t use a vast number of images, we kind of owe it to photobloggers and image-heavy sites to at least let their systems breathe and make it easier to manage. So I’m looking for sane, scalable storage methods.

Since we’re already planning moving public assets to a single folder I figured now might be a good time to do it for images too. Maybe it’s a step too much too soon? I’m fine with leaving the /images directory in place and just augmenting it with subdirs for multiple thumbs. But I do want to get away from ‘everything in one dir’. On that note…

for each image id also have a sub directory in there labelled with image ID and within that a temp folder for the work images plus the image at various sizes – either loose or again within subdirectories.

… doing that swaps 20000 image files for 20000 directories. Is that any better from a management perspective? Marginally, but it’s still s-l-o-w, as it has to read all those dir names when you enter the /images directory.

Honestly, I don’t know which way to go. But I think we have to do something.

Regarding temp images, I was intending to have this file structure:

/path/to/images
-> _tmp
-> subdir/image files
-> subdir/image files
-> subdir/image files
...

That well-known _tmp dir is the one that houses your undo states. Each time you make an adjustment, a new file with the ID and the transformation applied is written there and we link to it from the UI so you can see the effects in the browser. When you Apply, that version is copied over, replacing the one in your real subdir, and the temp history files for all images matching that id are trashed.

My reasoning was that if we have a _tmp dir in each image’s subdir we have to muck about with creating them if they don’t exist. And if you ever want to clean up and remove them all, it’s simpler to just go into the single, well-known _tmp dir and blat all the accumulated images than it is to hunt for them in every single dir splattered across the file system.

Last edited by Bloke (2021-03-28 12:19:03)

Bloke · 2021-03-28 12:07:45

phiw13 wrote #329537:

one thing I don’t like is a date-based system… that gives “labyrinth” a bad name

That’s my experience with it too. Although I started out wondering if we could do it better, I’ve not yet found a way to use date-based dir structures in a sane way that helps someone who’s tearing their hair out shouting where the hell is this particular picture stored?

I am thinking “name” here not so much for administrative management (DB side) as the user friendly part

Using name would be nice if we could solve the duplication issue. i.e. what happens if you replace an image file with one where you’ve changed the filename. We already have that (awful, imo) disconnect on the Files panel where you can upload sales_presentation_v1.pdf and serve it; then, later, click to edit, choose File Replace and swap it for sales_presentation_v2.pdf. The file is uploaded and renamed to sales_presentation_v1.pdf so you can still access it from the same name. That’s:

Good that you don’t get a broken download link.
Bad that the version indicated in the file name doesn’t match the actual file content.
Bad that your browser might not realise it’s a new version and continue to serve the old one – and there’s no neat way of ‘force refreshing’ a file_download. The usual SHIFT+REFRESH doesn’t always work because it’s a redirect.

Adopting that kind of thing with images fills me with trepidation. We don’t have the refresh issue so much (because images are usually served directly and are more easily refreshed) but having bad filenames that don’t quite represent their content because you’ve decided to improve its name is worse than using a faceless ID+metadata, from an SEO standpoint: potentially misleading.

Filesize tend to be much larger… Quality, as in for certain types of images you have lots of unsharpness along edges and border, odd colour shifts

Yeah if you’re working with fine detail images with sharp edges, auto-creation can be quite horrible. That’s why I’d like to retain the ability to replace individual thumbs easily from the UI. And permit you to swap them with versions of different type.

Perhaps the ‘filter’ controls are a step too far in this regard. Maybe you only see those if you choose the main ‘working copy’? If one is eschewing ImageMagick/GD completely anyway and creating thumbs by hand, you won’t be using the automatic conversion anyway during upload. Adopting a workflow like you suggest with smd_thumbnail makes perfect sense here. Create the thumb by default (for the List/Grid view), then switch on the profile(s) you want to edit, click into the image, upload your thumbs, then turn the profiles off after until you need them again.

To support this workflow here in core:

Leave everything at their defaults. You’ll get an original, a working copy and a grid thumb created by the system.
Click to edit.
Use some kind of ‘add thumbs’ feature to indicate you want some additional thumbnails for this size image. That adds them to the dropdown so you can select them, then…
… click browse to load an image into that slot. Repeat for each thumb.

Bonus points could be had here if we can find ways to intelligently solve these use cases:

Ability to add more than one thumb at once, e.g. from the edit panel, permit selection of multiple images. Perhaps you could select the entry at the bottom of the dropdown that says “New sizes” and then Browse. Select 1, 2, 3… whatever number of images you want as thumbs and upload them. Txp creates a thumb from each of them and uses the dimensions of each image it reads as the sizes available in the dropdown, adding them automatically. I like that.
From the list panel, the ability to upload more than one version of the same image. At the moment, each image of a multi-set is assumed to be a discrete image. If you’ve gone to the trouble of prepping a set of them externally, it’d be nice to have a radio/checkbox alongside the Upload button that says “Treat these pics as one image set”. So if you pick 4 images when you upload, instead of creating 4 discrete new IDs, it creates one ID and uses each of the selected images as different size thumbnails (plus auto-creating working image and grid for Txp’s use if the dimensions don’t match any of the incoming files). Again, as above, the sizes can be read from the incoming files to create the ‘profiles’ for that image on the fly.

In the latter case, we could actually make it so that if you indicate all the images you’re uploading are a set, it switches you straight into the Edit panel after upload. So you can add the metadata alt/caption immediately. But if you state all the images are independent, it leaves you on the List panel as it does now.

File storage aside, I think there are some nice wins to be had here to allow people who are serious about not using ImageMagick/GD to bypass them and use their own for a superfast workflow.

Whether this is enabled in core or we expose the image system in a better way via callbacks so a plugin can augment it is something we can decide as we iterate things.

Last edited by Bloke (2021-03-28 12:25:42)

philwareham · 2021-03-28 12:36:01

Might be some useful ideas here, if you’ve not already read it.

I think each image ID is going to need it’s own directory to house it’s images and the temp directory for it’s work files. How we structure the directories above that is the conundrum. Looks like a 32,000 sub directory limit per directory level needs to be heeded so definitely a way of breaking the directories down into small chunks has to be solved.

Bloke · 2021-03-28 15:01:25

Good read, thanks Phil. I like that one about using the last 4 digits/chars and splitting on that. MIght be simpler than a completely random hash, although the first 999 files may need special dispensation. Haven’t checked his code yet and seen how it scales. I’ll investigate this week.

philwareham wrote #329546:

each image ID is going to need it’s own directory to house it’s images and the temp directory for it’s work files.

One subdir per id: yes, we could do that. Or a few images in one. Prefixing each image with its id means files can co-exist happily in a dir. And they’re grouped naturally when sorting so you can see all the files related to a single image ID in FTP/file browsers, which is a bonus.

Keeping individual temp dirs is attractive from a compartmentalisation viewpoint but does make it harder (potentially) to clean up by hand if you start getting files floating around due to crashed browser sessions or using the browser’s Back button rather than ‘Cancel’ (which can clean up after itself).

I don’t think a single well-known temp dir is too big a deal. You’ll only usually work on one image at a time anyway; even in a multi-user environment there’ll only be a few of you potentially working on images at any one time so the dir won’t get crowded. Just like with regular images, prefixing temp files with the ID being worked on will make it simple to locate ones that belong to the same editing session and blat them if necessary.

One thing we do need to be mindful of is that, under *nix, adding a temp dir to each subdir uses one extra inode for something that is only occasionally used. And inodes are a precious commodity. It’s entirely possible (even today, I believe: but I’m willing to be proved wrong) to run out of inodes despite there being a tonne of storage space left on your drive.

The same can be said for creating a subdir of its own id. We could do that, but for every id, there’s one inode used for each physical file size, plus an extra inode for its /id subdir. Back of a napkin example:

8000 image IDs.
5 responsive sizes, including system thumb and system working copy = 40000 inodes.
Number of subdirectories to house these files if using the hash method I mentioned above, with a three-char code: 3528 (=43528 inodes).
One extra subdir per id to house its own files: 8000 (= 51528 inodes).
One extra subdir per id for dedicated temp files: 8000 (= 59528 inodes).

That’s an extra 8K or 16K inodes used up for no (or very little) additional storage value. So, imo, the flatter the structure the better.

How we structure the directories above that is the conundrum.

Absolutely. That’s what I’m trying to solve in as sane and simple manner possible to limit the maximum dir space (so there’s no capability for there to be tens of thousands of dirs directly under /images) and keep a reasonable number of files within each of those. And keep that structure as flat as possible.

Last edited by Bloke (2021-03-28 15:04:25)

zero · 2021-03-28 16:43:30

Bloke wrote #329544:

Because there are people with thousands and thousands of images.

What if you designed the filesystem for people who use less than your (the developer’s) ideal maximum number of images? Then, for those with more images, create a plugin that can create filesystem 2 in parallel with the basic filesystem. And want more? Do it again.

I haven’t got a clue if this is possible, but if the title of each filesystem could be changed by the user to suit their needs, and it was only a click or two to setup, it could be convenient enough for power users. And cut down on the complications.

ax · 2021-03-28 16:48:27

Bloke wrote #329547:

there’s one inode used for each physical file size, plus an extra inode for its /id subdir

Actually you can exceed the number of inodes when storing images, and I may be the only one in the forum who ever experienced this “no more inodes” error message when your device is only 10% filled, but your inodes pool has been exhausted. This happened with an 4.1T device and about 600m images (from converting > 10.000 raw images to a pyramid image format, resulting in 50.000 tiled images or more per raw image in hierarchical subdirectories).

Therefore:

for practical purposes, counting inodes is irrelevant, unless you store pyramid images
in that case, a non-inode based file system should be used (I ended up with ReiserFS and btrfs)
hierarchical subdirectories can be useful for hosting images, including deep zoom images, and only in this rare instance an ext4 file system can be pushed to its limits.

Bloke · 2021-03-28 17:32:49

ax wrote #329549:

in that case, a non-inode based file system should be used (I ended up with ReiserFS and btrfs)

Good to know there are options, thanks. Assuming whoever runs into the problem is using a hosting platform (or doing it themselves) that allows them to freely choose a filesystem.

zero wrote #329548:

What if you designed the filesystem for people who use less than your (the developer’s) ideal maximum number of images? Then, for those with more images, create a plugin that can create filesystem 2 in parallel with the basic filesystem. And want more? Do it again.

Interesting. Wouldn’t even need a plugin, I expect. We could introduce a tiered structure like this:

/path/to/images/
-> vol0001/subdir/image files
-> vol0001/subdir/image files
-> vol0001/subdir/image files
...
-> vol0002/subdir/image files
-> vol0002/subdir/image files
-> vol0002/subdir/image files
...

If we limited the subdirs to the 2-character code system (or something of that ilk) then vol0001 could comfortably house up to, say, 50K image IDs. That’s going to cover 99.9% of users I expect. That has a maximum of 256 subdirs, so not onerous on FTP programs either.

Fully loaded, that’ll deliver an average of about 200 image IDs for every subdir. So if you had 5 image thumbnail sizes per ID, that’s 1000 files in each subdir. Not ideal, but certainly an FTP/file browser can handle that reasonably well.

So if we internally hardcode the ‘click over’ point to be 50000 image IDs, then any ID value lower than that is in vol0001. From image ID 50001 – 100000, they’d appear in vol0002, and so forth.

Now, while this doesn’t limit the number of volumes that can be handled – it’s open-ended so you could run into problem down the line – you’ve really got to be going some to stress it:

With 10 volumes you can handle half a million image IDs.
With 100 volumes you can handle 5 million image ID.
With 1000 volumes – where things might start to slow down if you FTP to /images – you can handle 50 million image IDs.

That’s not physical files, that’s image id values. Based on the average user having 5 files per image id (1x original, 1x working copy, 1x grid, and 2x custom sizes), that’s 250 million actual files.

If you were prepared to wait a bit for your initial directory read (once you choose your volume and enter it, the performance would return to normal), then with 10000 volumes, you could stretch your Txp to support half a billion image IDs – which is 2.5 billion image files. That exceeds the system (by a factor of ~4) that ax found was where the inode count started to struggle, so I’m kinda happy that we can outstrip the inode table without too much degradation at a shade over 2000 volumes.

Thus, for all intents and purposes, we can support unlimited images as you’re more likely to hit the filesystem limits before you reach Txp’s limit.

Maybe that’s the way to go. It’s certainly scalable. And if you want to know which volume an image is in, check its ID, divide by 50000, then round up. And within that volume, you can find its subdir by again taking the id and doing some operation on it (to be decided but maybe just the last two digits will be good enough – I’ll have to check and see what the distribution is like).

That’s certainly given me something to ponder, so thank you both.

Last edited by Bloke (2021-03-28 17:41:38)

Textpattern CMS

Textpattern CMS support forum

#25 2021-03-28 08:29:04

Re: Images and the file system layout from 4.9.0

#26 2021-03-28 09:56:02

Re: Images and the file system layout from 4.9.0

Bloke wrote #329513:

#27 2021-03-28 10:31:30

Re: Images and the file system layout from 4.9.0

Bloke wrote #329532:

#28 2021-03-28 10:50:01

Re: Images and the file system layout from 4.9.0

#29 2021-03-28 10:59:22

Re: Images and the file system layout from 4.9.0

colak wrote #329533:

#30 2021-03-28 11:35:58

Re: Images and the file system layout from 4.9.0

philwareham wrote #329536:

#31 2021-03-28 12:07:45

Re: Images and the file system layout from 4.9.0

phiw13 wrote #329537:

#32 2021-03-28 12:36:01

Re: Images and the file system layout from 4.9.0

#33 2021-03-28 15:01:25

Re: Images and the file system layout from 4.9.0

philwareham wrote #329546:

#34 2021-03-28 16:43:30

Re: Images and the file system layout from 4.9.0

Bloke wrote #329544:

#35 2021-03-28 16:48:27

Re: Images and the file system layout from 4.9.0

Bloke wrote #329547:

#36 2021-03-28 17:32:49

Re: Images and the file system layout from 4.9.0

ax wrote #329549:

zero wrote #329548:

Board footer