Git LFS… use or avoid?

July 20, 2023November 26, 2023 oldcodersramblingLeave a comment

What is LFS?

The Git-LFS is an extension to the standard GIT meant to deal with “Large Files Storage”.

The standard, raw GIT deals well with any kind of files of practically any size. If however user does simple and common operation:

git clone

then the entire history is downloaded and stored locally. The user usually performs this action to use or work on some development of the existing data, what means that he/she is most interested in the current state of work. The entire history is usually not necessary, but who will bother with shallow cloning when a standard “clone” is easiest to do?

Now imagine, that You decided to use GIT to store some JPG images You work on. One image is about 4MB, and You have a 100-levels deep history. This gives 400MB of repository size since JPG files are so much compressed and data-scattered that GIT will have a hard deal with making an efficient diff-compression.

And here the LFS comes to play. It is delaying the actual download of those images till Your user runs:

git checkout branch/commit

Thanks to that approach You may save on a bandwidth of Your GIT server a lot.

How it is done?

Commit/checkout

Basically when You enable the LFS extension, then each time You commit a file matching a pattern You told LFS to take care of, it will replace that file in a repository commit with a simple text information: “I, the LFS, took care about it and stored it as XXXX”. Then it will copy this file somewhere inside the .git git folder of Your repository.

The checkout does reverse, it detects the text information and is using it replaces it with the file.

Push/pull

If You do the git push, then it is pushing the commit the usual way and then it will send, using a dedicated protocol, those stored files to a server. The server will put them in a structure called “file storage” and match with XXXX from mentioned above text information.

When You pull nothing is done, unless You will specifically tell LFS to download it’s files. Instead when git checkout can’t find files locally, it downloads them from the server using LFS protocol.

Benefits

At first I was very happy with it. It was doing its job and looked like a good way of keeping CAD files in GIT.

Unless I realized some nasty side effects.

Side effects

File-format standardization

I do maintain a small company GIT server. I took a great care to run it on LVM mirrors and do daily diff-backup on an external drive. So I am well protected against hardware failure (mirroring) and against sabotage (daily backup).

But what if the server will have to go down for a long time? Or what if the server software will loose the compatibility with ever changing environment and I won’t be able to keep it running?

I did took care to check how the GIT repositories are stored by this server. And they are kept as plain bare repos. This means, that if the server will go down I can simply copy those bare repositories to any other server or even to SAMBA file server. It will have restricted functionality, but I will be able to use them without having the server software running.

This is because GIT is defined at file level. Anyone who writes GIT server will be surely tempted to use git-lib/JGIT to do the hard work and won’t be trying to re-invent the wheel.

Protocol standardization

With the LFS it is different.

Putting aside the vague, loose and very imprecise specification, the LFS is specified only at protocol level. The implementation of local file storage is not a part of a specification, and there is absolutely no specification about how server should reflect “file storage” and its relationship with a repository on the server file system.

This means, that if the server will die to the level, that I won’t be able to make it to handle git clone --mirror then I won’t be able to transform the server-side format into local format. And without it I won’t be able to push it to an another server.

With bare GIT repositories I can just copy them to the another server implementation. With GIT-LFS I can copy them only to another instance of the same implementation.

This scared me a bit.

Centralized instead of distributed

The another bad thing is the fact, that once You start using LFS, You turn Your GIT system from distributed data storage to a centralized one.

The first moment I realized that was when the network in my company died and I could work with all plain GIT repositories, but couldn’t do much with those which were using LFS.

Then it occurred to me:

With a plain GIT, each time a user does git clone, I do gain a free-of-charge complete^*) backup of the repository data . Even tough my server is well set-up, it may always get attacked. Deletion of data will be easily detected, but if an attacker will change repository content it may get undetected. Until, of course, when someone who has a clone will do the git push. GIT will complain to them and I will have a chance to detect the problem and will have an untainted backup on some of workstations.

Abusive to server

The other fact I wasn’t aware when I started using LFS for CAD files it was that there won’t be any efficient diff compression. The LFS just takes files and sends them to server. The server I have isn’t very smart in that area, and is not doing the diff-compression by itself. So the server side repositories ballooned out quickly to surprising sizes.

Abusive to history

Initially when I was reading about “intercepting files on commit” I was, being a coder, under an impression that it is done in a right place. That is after computing the commit hash and before the file content is directed to diff-compression routines which are turning them into GIT blobs.

Unfortunately it is done before computing the commit hash.

Where is the nastiness in it?

In those two commands:

git lfs migrate import
git lfs migrate export

They are clearly the recommended ways of turning Your existing repository into LFS one and back-wards, the LFS managed one into a plain GIT managed.

Being careless in that manner I thought: “Ok, so I can switch back to plain GIT if I don’t like LFS.”.

This is true only as long, as long there are no clones.

The fact, that LFS acts before commit hash computation means, that migration to or from it will rewrite history of repository changing hashes of each and every commit. This is horrible since after such an operation each and every clone out in the world won’t be able neither to git pull nor git push.

No files deletion

At certain moment I though: “All right, I thought, I made a mistake over-using the LFS. I can get a control over all clones in a company, tell users to push their changes, migrate off the LFS and tell them to clone again from the scratch.”.

So I did it, destroying the consistency of a history between server and clones:

git lfs migrate export ...
git push --force

And You guess what? The file storage size on the server did not change. Nothing. Zero. All files were left as they were there before. Exactly as if LFS would be still in use. Hmm… maybe a bug at server side?

So I did inspect the LFS protocol. Even asked LFS guys. They did confirm: There is no way for LFS to delete objects from file-storage.

Ehm…. Say what?!

Gladly this server traces which file comes from which repository and if I delete the repository the files in file-storage are also deleted. The problem is, that the same happens with all the tickets, discussions and other data which are bound with a repository. Hardly can call it a work-around.

Summary

I am very sad to say it, but if I would have to say something about git LFS it is:Avoid it at all cost!.

Be very, very careful and consider why you need it and balance it with all above side effects.
Remember, once You enter the LFS path it will be rather hard to abandon it completely. You can stop using it for new commits, but You will have to keep it around for accessing history.

For me the absolutely only reason to use it, it is to spare on network bandwidth. But there are better ways of doing it. Read about shallow clones, single branch clones and blobless clones.

^*)Well… not truly complete backup of all server data. Tickets, issues, forums and etc. are usually kept outside the repository structure. But the actual work itself is preserved.

Abstract file format is on Github

March 4, 2023December 9, 2023 oldcodersramblingLeave a comment

I got myself to publish still experimental work about “abstract file formats” on the GitHub: sztejkat.abstractmft.

What is there?

Well…

The set of JAVA packages which do implement the concept I was talking about there and there and there.

Just for Your information.

How not to: Git & fatal: detected dubious ownership

January 16, 2023November 26, 2023 oldcodersramblingLeave a comment

Today I was hit by this message:

“fatal: detected dubious ownership”

The reason for that was, that I was logged in as a different user and cloned repo as different user. The new GIT thinks this is not ok, and do the “fatal” at each script I tried.

Of course, as in GIT it is always done, it pointed me with the solution. Set the global configuration variable:

safe.directory

to either * if I don’t like this functionality or to specific folder in which I do allow more than one user to work.

Fine. Shocking and disturbing but fine.

It had shown me how to solve a problem, right?

Except it is all wrong!.

Security issue solved?

The primary idea behind this functionality is to defend against following mode of attack:

Let’s say a user “hombre” has a following folders structure:

 /--+
    |
    + home 
        +
        +- hombre
           |
           + my_projects +
                         |
                         + project_x
                                +-- .git
                                +--- notes
                                +--- libs
                                        +--- mylibrary_A

The user has the project_x repository and the bold .git folder inside it. The .git may be quite a tricky beast and contain filters configuration which may allow some code to be run at almost every git command.

The git command, when run, looks up in current folder and upwards for .git and loads configuration from there.

So if one can inject a fake, malicious .git in there:

 /--+
    |
    + home 
        +
        +- hombre
           |
           + my_projects +
                         |
                         + project_x
                                +-- .git 
                                +--- notes
                                +--- libs
                                        +--- .git
                                        +--- mylibrary_A

then if user “hombre” will type:

  cd ~/my_projects/project_x/libs/mylibrary_A
  git status

then GIT will happily look in the .git and do what it is told there.

Alternatively one may just:

 /--+
    |
    + home 
        +--- .git
        +
        +- hombre
           |
           + my_projects +
                         |
                         + project_x
                                +-- git ← notice, I just removed the dot from the name
                                +--- notes
                                +--- libs
                                        +--- mylibrary_A

and the effect will be very alike.

This is a serious issue, they say, which allows to make “hombre” to run code decided by somebody else…

Except it is a bullshit

First of all, how the hell an attacker may inject the .git?!

One must have write access to “hombre” folders. And if that user got such write access, then, dear me, it is either a legit user who is in an appropriate group or Your system is so much compromised that anyone can do anything. In first case it was a fully legitimate case, in second, You are boned, dead and stinking.

Just manage access right correctly silly puss!

You can disable if You don’t like it….

The effect is such, that most users will just:

 Windows version
 git config --global safe.directory=*
  or
  git config --system safe.directory=*
 Bash version
 git config --global safe.directory="*"

and viola.

If someone was soooooo lazy to not setup security and allow not trusted user to manipulate trusted users data, then that one may stay with this option on. Nobody else will ever need it.

No problem, devs do say, You can always disable it.

Yes sure.

Money, money, money….

Figuring out what is going on was not very easy. Especially that GIT displayed cryptic user information in a form of Windows UID instead of user name. I needed about half an hour to make sure, that there is nothing really wrong with my system, and that I did create in fact repositories using another account than this I am working at now. This is a company owned machine and we switched from one account system to another about a year ago.

I spent 30 minutes on this, what gives 6 Euros I did not earn and close to 10 Euro of employer costs.

And here comes the sad thing. I, myself, needed half an hour. In my company we have about 40 persons who do use GIT, and all of them will have alike problems, because all were subject to account switching last year.

In fact in 99% of company owned systems the owner of some shared resources is not the “hombre” user, but a “group owner”. This is how You manage access in a cost efficient way. And a new, more secure GIT will complain.

I needed 30 minutes. Some of those 40 persons will also need 30 minutes, some of them will catch up faster because they will ask me, some will have to call the IT department because we use GIT not only for code, but for documents version tracking. And those guys can really just click some batch scripts to handle guided commits. The IT action won’t close in less than one work hour, counting both IT personnel, intermediates and the user. Plus some significant, sometimes one working day, delay and stall.

The total cost will be around: 20*5E + 15*10E + 5 * 20E = 350 Euros.

Now let us do some more math.

How many users of GIT there is all over the world? This is very popular software. The count of public repos is around a million, so we can be on the safe side to say that we have about 1 million GIT users. Let’s say half of them is corporate and half of that half will have a problem.

We have 250’000 of problems.

Each problem will be solved during about, by average 30 minutes. I think I am an average, this is a good guess I suppose.

The cost of this “security feature” is then:

2'500'000 Euro

This is a raw cost. If we however do talk about corporate users we should also account for a “lost profit” cost. I did look up for a solution, I didn’t do my job. Due to that my work did not produce expected profit.

How much is that “lost profit”?

Well… my company is not bringing financial losses, so it means, it must earn at least a bit more than those 10 Euro each half hour of my work. Considering taxes and etc, I think than 25% profit will be an absolute minium.

So the total amount of money lost all over the world was:

3'125'000 Euro

Three millions.

Now a question to ask: How much money was saved by this feature? And where and by whom? Which system was actually efficiently protected by it? Did this system cost more than 3 millions Euro?

Summary

The devs did agree that it was a “disruptive change”. A “disruptive change” is a change which forces any action from user just because the user updated the GIT.

Please, please, please, always carefully consider the total global cost when introducing “disruptive change”. It already costed my company 350 Euros. I personally lost half an hour. I will have to loose much, much more, because I will have to:

update company guides about it;
update training documents;
review all scripts on git server if they are used in a way affected by this “security” feature;
train my subordinates about it.

I think I will need to invest at least 40 work hours more just to be sure that this function is disabled at all workstations.

Yes, disabled.

Because we do implement proper IT security.

Old coder's ramblings about software concepts and quality.

software quality and concepts

Category: Programming tools

Git LFS… use or avoid?

What is LFS?

How it is done?

Commit/checkout

Push/pull

Benefits

Side effects

File-format standardization

Protocol standardization

Centralized instead of distributed

Abusive to server

Abusive to history

No files deletion

Summary

Abstract file format is on Github

How not to: Git & fatal: detected dubious ownership

Security issue solved?

Except it is a bullshit

You can disable if You don’t like it….

Money, money, money….

Summary