Discussion:
Alternative to hash-bang
D'Arcy J.M. Cain
2014-07-18 20:22:59 UTC
Permalink
There is currently a discussion in the PostgreSQL list about how to get
unix systems to read an SQL script and run it without choking on
invalid statements such as "#! usr/whatever/bin/psql" at the top of the
file. It occurred to me that the old hash-bang thing was a little
restrictive and perhaps it is time to add to the list of magic numbers
that run commands using alternative commenting character. I am
thinking "--!!" for SQL scripts. The ';' character is also common so
perhaps ";!" as well. Those two, with "#!" probably cover 99% of the
possible commenting methods. Would doing this interfere with anything
else? Or should it be PostgreSQL's job to ignore the shebang if it sees
it at the start of its input?

By the way, "man magic" points to /usr/share/misc/magic which doesn't
actually exist. There is a binary file /usr/share/misc/magic.mgc which
I assume has the data in a format that reads faster but maybe the
source information should be put into the file mentioned by the man
page. If not then I think the man page should be fixed.
--
D'Arcy J.M. Cain <***@NetBSD.org>
http://www.NetBSD.org/ IM:***@Vex.Net
Joerg Sonnenberger
2014-07-18 20:31:53 UTC
Permalink
Post by D'Arcy J.M. Cain
There is currently a discussion in the PostgreSQL list about how to get
unix systems to read an SQL script and run it without choking on
invalid statements such as "#! usr/whatever/bin/psql" at the top of the
file.
There are two typical approach for this and IMO both work well enough.
The first ione is to just teach the "interpreter" about the hash bang,
possibly under an option. That works well for things like psql. The
second option is to make it an actual shell script, which extracts the
content after a marker and calls the real tool.

Joerg
D'Arcy J.M. Cain
2014-07-18 23:14:58 UTC
Permalink
On Fri, 18 Jul 2014 22:31:53 +0200
Post by Joerg Sonnenberger
There are two typical approach for this and IMO both work well enough.
The first ione is to just teach the "interpreter" about the hash bang,
That's the suggestion on the PG list. It led me to think about
generalizing it.
Post by Joerg Sonnenberger
possibly under an option. That works well for things like psql. The
second option is to make it an actual shell script, which extracts the
content after a marker and calls the real tool.
That works if you call it as a script but it doesn't work for "psql -f
filename" which would be necessary if you want to be portable under
Windows. On the other hand my suggestion makes it unportable to other
unices but if we did it it might catch on.
--
D'Arcy J.M. Cain <***@NetBSD.org>
http://www.NetBSD.org/ IM:***@Vex.Net
Greg Troxel
2014-07-18 23:23:17 UTC
Permalink
Post by D'Arcy J.M. Cain
There is currently a discussion in the PostgreSQL list about how to get
unix systems to read an SQL script and run it without choking on
invalid statements such as "#! usr/whatever/bin/psql" at the top of the
file. It occurred to me that the old hash-bang thing was a little
restrictive and perhaps it is time to add to the list of magic numbers
that run commands using alternative commenting character. I am
thinking "--!!" for SQL scripts. The ';' character is also common so
perhaps ";!" as well. Those two, with "#!" probably cover 99% of the
possible commenting methods. Would doing this interfere with anything
else? Or should it be PostgreSQL's job to ignore the shebang if it sees
it at the start of its input?
Once --!! happens, then there will be requests to add all sorts of
things, and this seems like a mess. Surely it's easy enough to teach
psql to ignore the first line if it starts with #!, because that's
obviously not valid sql. Then it would work everywhere.
D'Arcy J.M. Cain
2014-07-19 00:15:36 UTC
Permalink
On Fri, 18 Jul 2014 19:23:17 -0400
Post by Greg Troxel
Once --!! happens, then there will be requests to add all sorts of
things, and this seems like a mess. Surely it's easy enough to teach
As I said, '#', ';' and '--' seem to cover most situations. Maybe '%'
to a much smaller extent. Can you think of many other possibilities?
Post by Greg Troxel
psql to ignore the first line if it starts with #!, because that's
obviously not valid sql. Then it would work everywhere.
True and that's a good solution for PG. Just thinking out loud here.
--
D'Arcy J.M. Cain <***@NetBSD.org>
http://www.NetBSD.org/ IM:***@Vex.Net
Alan Barrett
2014-07-19 06:57:20 UTC
Permalink
Post by D'Arcy J.M. Cain
Post by Greg Troxel
Once --!! happens, then there will be requests to add all sorts of
things, and this seems like a mess. Surely it's easy enough to teach
As I said, '#', ';' and '--' seem to cover most situations. Maybe '%'
to a much smaller extent. Can you think of many other possibilities?
Are you proposing to make the kernel recognise several new
#!-equivalents?

It would be easy, and flexible, and not especially ugly, to make
the kernel scan for "#!" at some non-zero offset in the file. Say
at all offsets from 0 to 4, inclusive. Then you could write up
to 4 bytes to make the other interpreter think it's a comment,
followed by the traditional #! stuff. If you need %%-- to make
the other interpreter think the line is a comment, then write
%%--#!/usr/local/bin/interpreter.

--apb (Alan Barrett)
Steffen Nurpmeso
2014-07-19 13:36:48 UTC
Permalink
Alan Barrett <***@cequrux.com> wrote:
|On Fri, 18 Jul 2014, D'Arcy J.M. Cain wrote:
|>Greg Troxel <***@ir.bbn.com> wrote:
|>> Once --!! happens, then there will be requests to add all sorts of
|>> things, and this seems like a mess. Surely it's easy enough to teach
|>
|>As I said, '#', ';' and '--' seem to cover most situations. Maybe '%'
|>to a much smaller extent. Can you think of many other possibilities?
|
|Are you proposing to make the kernel recognise several new
|#!-equivalents?
|
|It would be easy, and flexible, and not especially ugly, to make
|the kernel scan for "#!" at some non-zero offset in the file. Say
|at all offsets from 0 to 4, inclusive. Then you could write up

How about supporting the more-and-more common Unicode
Byte-Order-Mark for UTF-8 encoded shell scripts? Even though
i personally don't like it it does have it's merits and it will
become more and more common; which is why i think supporting them
would be an enhancement, the sooner the better; anyway better than:

?0[***@nhead tmp]$ cat t.sh
#!/bin/sh
echo this is a shell script with BOM
?0[***@nhead tmp]$ s-hex t.sh
00000000 ef bb bf 23 21 2f 62 69 6e 2f 73 68 0a 65 63 68 |#!/bin/sh.ech|
00000010 6f 20 74 68 69 73 20 69 73 20 61 20 73 68 65 6c |o this is a shel|
00000020 6c 20 73 63 72 69 70 74 20 77 69 74 68 20 42 4f |l script with BO|
00000030 4d 0a |M.|
00000032
?0[***@nhead tmp]$ ./t.sh
./t.sh: #!/bin/sh: not found
this is a shell script with BOM
?0[***@nhead tmp]$

I think the attached diffs should do it, but i don't know (repo
from end of june, not even compile tested; but i think it is good)

--steffen
Thomas Klausner
2014-07-19 13:41:45 UTC
Permalink
Post by Steffen Nurpmeso
How about supporting the more-and-more common Unicode
Byte-Order-Mark for UTF-8 encoded shell scripts?
As wikipedia says:

The Unicode Standard permits the BOM in UTF-8,[2] but does not require
or recommend its use.[3]

[2] "The Unicode Standard 5.0, Chapter 2:General Structure" (PDF). p. 36. Retrieved 2009-03-29. "Table 2-4. The Seven Unicode Encoding Schemes"

[3] "The Unicode Standard 5.0, Chapter 2:General Structure" (PDF). p. 36. Retrieved 2008-11-30. "Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature"

Thomas
Steffen Nurpmeso
2014-07-19 15:00:29 UTC
Permalink
Hello,

Thomas Klausner <***@NetBSD.org> wrote:
|On Sat, Jul 19, 2014 at 03:36:48PM +0200, Steffen Nurpmeso wrote:
|> How about supporting the more-and-more common Unicode
|> Byte-Order-Mark for UTF-8 encoded shell scripts?
|
|As wikipedia says:
|
|The Unicode Standard permits the BOM in UTF-8,[2] but does not require
|or recommend its use.[3]
|
|[2] "The Unicode Standard 5.0, Chapter 2:General Structure" \
|(PDF). p. 36. Retrieved 2009-03-29. "Table 2-4. The Seven \
|Unicode Encoding Schemes"
|
|[3] "The Unicode Standard 5.0, Chapter 2:General Structure" \
|(PDF). p. 36. Retrieved 2008-11-30. "Use of a BOM is neither \
|required nor recommended for UTF-8, but may be encountered \
|in contexts where UTF-8 data is converted from other encoding \

Yes. Using BOM won't work in `$ cat f1 f2 > f3' etc. which is why
i personally don't use them -- but i'm in a privileged situation
regarding my text etc. files.

|forms that use a BOM or where the BOM is used as a UTF-8 signature"

So this last part is the one that i think about. Many text
editors will get that right and/or use it right away, as do some
scripting languages, like, say Thomas Klausner, like perl(1):

=item C<BOM>-marked scripts and UTF-16 scripts autodetected

If a Perl script begins marked with the Unicode C<BOM> (UTF-16LE, UTF16-BE,
or UTF-8), or if the script looks like non-C<BOM>-marked UTF-16 of either
endianness, Perl will correctly read in the script as Unicode.
(C<BOM>less UTF-8 cannot be effectively recognized or differentiated from
ISO 8859-1 or other eight-bit encodings.)

And because of this last part again i finally come the conclusion
that the UTF-8 BOM will become a vivid part of the future, because
it carries information of a file's encoding along with the file as
a part of the encoding itself.

The real question is: what should be done with BOMs in `$ cat f1
f2 > f3', they cannot simply become stripped off?

--steffen
David Holland
2014-07-21 07:57:40 UTC
Permalink
Post by Steffen Nurpmeso
The real question is: what should be done with BOMs in `$ cat f1
f2 > f3', they cannot simply become stripped off?
What if f1 and f2 are non-Unicode files that happen to begin with
these bytes?
--
David A. Holland
***@netbsd.org
Justin Cormack
2014-07-21 09:47:49 UTC
Permalink
Post by Steffen Nurpmeso
And because of this last part again i finally come the conclusion
that the UTF-8 BOM will become a vivid part of the future, because
it carries information of a file's encoding along with the file as
a part of the encoding itself.
UTF8 BOMs are only really used on Windows due to its UTF16 heritage. I have
never seen them used on a Unix system. That is probably why Perl added
support. That should not mean the use should be encouraged.
Post by Steffen Nurpmeso
The real question is: what should be done with BOMs in `$ cat f1
f2 > f3', they cannot simply become stripped off?
Write a utfcat command?
Steffen Nurpmeso
2014-07-21 12:36:16 UTC
Permalink
Hello,

Justin Cormack <***@specialbusservice.com> wrote:
|On Jul 19, 2014 4:00 PM, "Steffen Nurpmeso" <***@yandex.com> wrote:
|> And because of this last part again i finally come the conclusion
|> that the UTF-8 BOM will become a vivid part of the future, because
|> it carries information of a file's encoding along with the file as
|> a part of the encoding itself.
|
|UTF8 BOMs are only really used on Windows due to its UTF16 heritage. I have
|never seen them used on a Unix system. That is probably why Perl added
|support. That should not mean the use should be encouraged.

Maybe. Yes. But in respect to the first two i had to learn that
some Unix systems (AIX) also use UTF-16; i don't know how hard IBM
as a i think paying core member of the POSIX standard will try to
push UTF-16 into the standard once that finally moves forward
towards true support for the languages of the world; maybe not at
all (their ICU library seems to improve UTF-8 support, still
i think the core is UTF-16).

|> The real question is: what should be done with BOMs in `$ cat f1
|> f2 > f3', they cannot simply become stripped off?
|
|Write a utfcat command?

Tja. A locale modifier like POSIX.UTF-***@BOM wouldn't cause the
right thing. Martin Dürst of W3C wrote a few years ago

Yes exactly. In the RFC 2070 and HTML4 time-frame, nobody that I know
was thinking about a BOM for UTF-8. Only later BOMs at the start of
HTML4 started to turn up, and browser makers were surprised. Roughly the
same happened for XML. Early XML parsers didn't handle the BOM.

When Windows notepad started to use the BOM to distinguish between UTF-8
and "ANSI" (the local system legacy encoding), this BOM leaked into
HTML, and was difficult to stop. So XML got updated, and parsers started
to get updated, too.

...

The problem with the BOM in UTF-8 is that it can be quite helpful (for
quickly distinguishing between UTF-8 and legacy-encoded files) and quite
damaging (for programs that use the Unix/Linux model of text
processing), and that's why it creates so much controversy.

--steffen

Marc Balmer
2014-07-19 07:45:09 UTC
Permalink
Post by D'Arcy J.M. Cain
On Fri, 18 Jul 2014 19:23:17 -0400
Post by Greg Troxel
Once --!! happens, then there will be requests to add all sorts of
things, and this seems like a mess. Surely it's easy enough to teach
As I said, '#', ';' and '--' seem to cover most situations. Maybe '%'
to a much smaller extent. Can you think of many other possibilities?
Post by Greg Troxel
psql to ignore the first line if it starts with #!, because that's
obviously not valid sql. Then it would work everywhere.
True and that's a good solution for PG. Just thinking out loud here.
For the time being, there is a (ugly) workaround, which might be non portable, use the '<<' input redirection, but omit the end marker in the file, it will then input till the end of the file:

#!/bin/sh

psql <<EOT
\?
\c mydb
select now();
Loading...