Unions and default value in apache avro serialization and deserialization

Initial avro schema (schema/user.avsc) defines a User record with a name field only.

{
  "namespace": "com.bawi.avro.model",
  "type": "record",
  "name": "User",
  "fields": [
    {
      "name": "name",
      "type": "string"
    }
  ]
}

Maven pom.xml defines avro dependency

        <dependency>
            <groupId>org.apache.avro</groupId>
            <artifactId>avro</artifactId>
            <version>1.8.1</version>
        </dependency>

so we can serialize the User data in Java to disc to user.avro file

        Schema schema = new Schema.Parser().parse(new File("schema/user.avsc"));
        File avroFile = new File("target/user.avro");
        GenericRecord user = new GenericData.Record(schema);
        user.put("name", "Alyssa");
        DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<>(schema);
        DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<>(datumWriter);
        dataFileWriter.create(schema, avroFile);
        dataFileWriter.append(user);
        dataFileWriter.close();

we can read (deserialize) User from the disc either by Java

        DatumReader<GenericRecord> datumReader = new GenericDatumReader<>(schema);
        DataFileReader<GenericRecord> dataFileReader = new DataFileReader<>(avroFile, datumReader);
        GenericRecord user = null;
        while (dataFileReader.hasNext()) {
            user = dataFileReader.next(user);
            System.out.println(user);
        }

or by using avro-utils jar that can be downloaded by maven when declared maven test dependency:

        <dependency>
            <groupId>org.apache.avro</groupId>
            <artifactId>avro-tools</artifactId>
            <version>1.8.1</version>
            <scope>test</scope>
        </dependency>

and running with ‘tojson’ argument

me@MacBook:~/dev/my-projects/my-avro$ java -jar /Users/me/.m2/repository/org/apache/avro/avro-tools/1.8.1/avro-tools-1.8.1.jar tojson users.avro 
{"name":"Alyssa"}

Then we will add a new favorite_number element to the schema:

{
  "namespace": "com.bawi.avro.model",
  "type": "record",
  "name": "User",
  "fields": [
    {
      "name": "name",
      "type": "string"
    },
    {
      "name": "favorite_number",
      "type": "int"
    }
  ]
}

and run the deserialization Java code for existing data in the user.avro but against the new schema, then we get:

Exception in thread "main" org.apache.avro.AvroTypeException: Found com.bawi.avro.model.User, expecting com.bawi.avro.model.User, missing required field favorite_number

since the favorite_number does not exist in avro file.

Adding only a union of int and null value does not help to get rid the error above.

The solution is to add a default value with a union for favorite_number e.g.:

    {
      "name": "favorite_number",
      "type": [
        "null",
        "int"
      ],
      "default": null
    }

to get: {“name”: “Alyssa”, “favorite_number”: null}
or add

    {
      "name": "favorite_number",
      "type": "int",
      "default": 0
    }

to get: {“name”: “Alyssa”, “favorite_number”: 0}

Please note that placing int as first argument of a union and having null as default value such as:

    {
      "name": "favorite_number",
      "type": [
        "int",
        "null"
      ],
      "default": null
    }

gives an error:

Exception in thread "main" org.apache.avro.AvroTypeException: Non-numeric default value for int: null

or

Exception in thread "main" org.apache.avro.AvroTypeException: Non-null default value for null type: 0

when

    {
      "name": "favorite_number",
      "type": [
        "null",
        "int"
      ],
      "default": 0
    }

as described in https://avro.apache.org/docs/1.7.7/spec.html#Unions

curl download file from ftp with user and password

me@MacBook:~/Downloads$ curl -OLv -u myusername:mypassword ftp://myftphost/myfile
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying myftpip...
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Connected to myftphost (myftpip) port 21 (#0)
< 220 FTP server ready
> USER myusername
< 331 Password required for myusername
> PASS mypassword
< 230 User myusername logged in
> PWD
< 257 "/" is the current directory
* Entry path is '/'
> EPSV
* Connect data stream passively
* ftp_perform ends with SECONDARY: 0
< 229 Entering Extended Passive Mode (|||44363|)
* Trying myftpip...
* Connecting to myftpip (myftpip) port 44363
0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0* Connected to myftphost (myftpip) port 21 (#0)
> TYPE I
< 200 Type set to I
> SIZE myfile
< 213 173938503
> RETR myfile
< 150 Opening BINARY mode data connection for myfile (173938503 bytes)
* Maxdownload = -1
* Getting file with size: 173938503
{ [1448 bytes data]
99 165M 99 164M 0 0 3881k 0 0:00:43 0:00:43 --:--:-- 3925k* Remembering we are in dir ""
< 226 Transfer complete
100 165M 100 165M 0 0 3885k 0 0:00:43 0:00:43 --:--:-- 4004k
* Connection #0 to host myftphost left intact

My text editors for xml and json – sublime and notepad++

Notepad++ – install version 32 bit since 64bit does not support plugin manager and some other plugins. Then go to plugins/plugin manager/ install: xml tools and go to ‘?’ to go to Get more plugins and search for ‘Json Viewer’, download the zip file, extract it and place dll file to C:/Program Files(x64)/Notepad++/plugins dir

Sublime Text editor 3:  ensure that ‘Package Control’ is installed by going to https://packagecontrol.io/installation, go to View -> console, copy and paste text, restart sublime, then go to Preferences/Package Control, and type ‘install package’, hit enter, type xpath for list of plugins and choose ‘xpath’ and ‘Indent-xml’

grep egrep sed match replace part of the text

url=http://svn.dev.mycompany.com/svn/myproject/branches/mybranch
svn log $url -r {2016-12-01}:{2016-12-30} --search Bartosz
------------------------------------------------------------------------
r324 | bartosz@mycompany.com | 2016-12-14 17:11:44 +0100 (Wed, 14 Dec 2016) | 1 line

created mybranch
------------------------------------------------------------------------

expected output:

http://svn.dev.mycompany.com/svn/myproject/branches/mybranch/?p=123 | bartosz@mycompany.com | 2016-12-14 17:11:44 +0100 (Wed, 14 Dec 2016)
url=http://svn.dev.mycompany.com/svn/myproject/branches/mybranch
svn log $url -r {2016-12-01}:{2016-12-30} --search Bartosz | egrep '^r[0-9]+' | sed "s|\(^r\)\([0-9]*\)|$url/?p=\2|" | sed 's/ | [0-9]* line.*$//'

where

egrep '^r[0-9]+'

is equivalent to extended grep -E and matches begging of line with rRev e.g r123

sed "s|\(^r\)\([0-9]*\)|$url/?p=\2|"

– with ‘/’ replaced by ‘|’ and taking only revision number (skipping ‘r’ in the beginning) and adding the url: ‘r123’ -> ‘http://svn.dev.mycompany.com/svn/myproject/branches/mybranch/?p=123&#8217;

sed 's/ | [0-9]* line.*$//'

– removes last part of message e.g.: ‘ | 1 line’

Docker quickstart cloudera

me@MacBook:~$ docker pull cloudera/quickstart:latest
latest: Pulling from cloudera/quickstart
1d00652ce734: Pull complete
Digest: sha256:f91bee4cdfa2c92ea3652929a22f729d4d13fc838b00f120e630f91c941acb63
Status: Downloaded newer image for cloudera/quickstart:latest

docker images

docker run –hostname=quickstart.cloudera –privileged=true -t -i –hostname=quickstart.cloudera –privileged=true -t -i -p 8888:8888 -p80:80 -p7180:7180 –name quickstart.cloudera -d 4239cd2958c6 /usr/bin/docker-quickstart

me@MacBook:~/dev/env$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
da0050c427c3 4239cd2958c6 “/usr/bin/docker-quic” 3 hours ago Up 3 hours 0.0.0.0:80->80/tcp, 0.0.0.0:7180->7180/tcp, 0.0.0.0:8888->8888/tcp quickstart.cloudera

docker attach quickstart.cloudera

ctrl + p + ctrl + q (detach the tty)

docker start quickstart.cloudera

me@MacBook:~$ docker stop -t 120 quickstart.cloudera
quickstart.cloudera

 

Backup rsync

1. Initial source folder structure:

me@MacBook:~/tmp$ find source
source
source/.DS_Store
source/a
source/b
source/dev
source/dev/.DS_Store
source/dev/x
source/dev/y.orig.vhd
source/dev/y.vhd

2. Lets backup source folder to destination folder using rsync

me@MacBook:~/tmp$ rsync -abvP  --backup-dir=backup_`date +%Y-%m-%d--%H-%M-%S` --include=*.orig.vhd --exclude={/backup_*,.DS_Store,*.vhd} --delete source/ destination/
building file list ... 
6 files to consider
created directory destination
./
a
           1 100%    0.00kB/s    0:00:00 (xfer#1, to-check=4/6)
b
           1 100%    0.98kB/s    0:00:00 (xfer#2, to-check=3/6)
dev/
dev/x
           1 100%    0.98kB/s    0:00:00 (xfer#3, to-check=1/6)
dev/y.orig.vhd
           1 100%    0.98kB/s    0:00:00 (xfer#4, to-check=0/6)

sent 343 bytes  received 120 bytes  926.00 bytes/sec
total size is 4  speedup is 0.01

explanation:

--include=*.orig.vhd

do not exclude files *.orig.vhd for any source (sub)directories (since include is before exclude then inclusion will first match rather then exclusion from *.vhd)

--exclude={/backup_*,.DS_Store,*.vhd}

copy all files excluding ~/tmp/backup_* files, excluding all .DS_Store and *.vhd files in any (sub)directories

3. The destination directory content:

me@MacBook:~/tmp$ find destination
destination
destination/a
destination/b
destination/dev
destination/dev/x
destination/dev/y.orig.vhd

4. Lets modify one file and delete another and then run the same rsync command:

me@MacBook:~/tmp$ echo "2" > source/a 
me@MacBook:~/tmp$ rm source/dev/x
me@MacBook:~/tmp$ rsync -abvP  --backup-dir=backup_`date +%Y-%m-%d--%H-%M-%S` --include=*.orig.vhd --exclude={/backup_*,.DS_Store,*.vhd} --delete source/ destination/
building file list ... 
5 files to consider
deleting dev/x
./
a
           2 100%    0.00kB/s    0:00:00 (xfer#1, to-check=3/5)
dev/

sent 195 bytes  received 54 bytes  498.00 bytes/sec
total size is 4  speedup is 0.02
--backup-dir=backup_`date +%Y-%m-%d--%H-%M-%S`

creates a backup folder in destination dir with previous versions of modified or deleted files

--delete

deletes the file in destination folder if they were deleted in the source dir

5. Then the destination folder will contain backup_2016-10-29–12-26-06 with previous version of a and dev/x before deletion:

me@MacBook:~/tmp$ find destination
destination
destination/a
destination/b
destination/backup_2016-10-29--12-26-06
destination/backup_2016-10-29--12-26-06/a
destination/backup_2016-10-29--12-26-06/dev
destination/backup_2016-10-29--12-26-06/dev/x
destination/dev
destination/dev/y.orig.vhd

Mac shortcuts

cmd+W – close window
cmd+Q – quit window
cmd+tab – switch windows

alt + cmd + space – open finder
cmd + shift + m – toggle zoom: maximize/minimize window
cmd + m – minimize application to dock icon (un-minize – select app with cmd+tab, while pressing cmd, start pressing alt and release cmd)
F5 for Reload This Page – System Preferences/Keyboad/Shortcuts/App Shortcuts/+ to add Google Chrome application, title Reload This Page/shortcut F5
F11 – show desktop
F3 – show mission control (all windows) or 3 fingers swipe up
ctrl+ down arrow – show Application Windows for (App Expose) or 3 fingers swipe down
ctrl + up arrow – show Mission control
ctrl + F8 + down arrow + down arrow + enter – lock screen (Utilities/KeyChain Access/Preferences/General: check: Show keychain status in menu bar)

Terminal:
alt + right/left arrow – move one word forward / backward
ctrl + A / E (or fn + shift + left/right arrow) – move to the begin / end of line

TextEdit/Browser (alt and cmd):
alt + right/left arrow – move cursor one word forward/backward
cmd + right/left arrow – move to end/begin of line (also ctrl + right/left when disabled remapped Mission Control)
cmd + up/down arrow  – move cursor to home/end
fn+right/left arrow – go to home/end
fn+up/down arrow – page up/down (curson does not move)
shift + alt +right/left – select one word from cursor to right/left
shift + cmd + right/left – select words from cursor to end/begin of line
cmd + c / v / x – copy / paste / remove
ctrl + tab – move between tabs in chrome
cmd + alt + shift + v – paste formatted text but adjust to the format of target text

Vim (fn and ctrl, does not use cmd so be able use the same keys in Windows/Linux):
fn + right/left – move to the begin/end of line (fn since no home/end button)
ctrl + right/left – move to the one word forward/backward (with disabled/remapped Mission Control)
cmd + shift + v – paste selected text (needs to be selected via mouse)

My Intellij (fn and ctrl, does not use cmd so be able use the same keys in Windows/Linux):
fn + right/left – move to the begin/end of line (fn since no home/end button)
ctrl + right/left – move one word forward/backward (with disabled/remapped Mission Control)
shift + fn + right/left – select words from curson to the begin/end of line
shift + ctrl + right/left – select one word forward/backward (with disabled/remapped Mission Control)
ctrl + c / v / x – copy / paste / remove

Setup

    1. Turn off AutoCorrect: Apple Menu > System Preferences > Keyboard > Text and turn  Autocorrect off by unchecking Correct spelling automatically
    2. Add ssh autocompletion for bash by appending ~/.bash_profile or ~/.bashrc:
_complete_ssh_hosts ()
{
    COMPREPLY=()
    cur="${COMP_WORDS[COMP_CWORD]}"
    comp_ssh_hosts=`
#        cat ~/.ssh/known_hosts | \
#        cut -f 1 -d ' ' | \
#        sed -e s/,.*//g | \
#        grep -v ^# | \
#        uniq | \
#        grep -v "\[" ;
        cat ~/.ssh/config | \
        grep "^Host " | \
        awk '{print $2}'
    `
    COMPREPLY=( $(compgen -W "${comp_ssh_hosts}" -- $cur))
    return 0
}
complete -F _complete_ssh_hosts ssh
    1. Terminal prompt structure and colors:
export PS1="\[\033[01;32m\]\u@MacBook\[\033[00m\]:\[\033[01;34m\]\w\[\033[00m\]\$ "
export CLICOLOR=1
export LSCOLORS=ExFxBxDxCxegedabagacad
alias ls='ls -Gh'

brew install svn (append .bash_profile: export PATH=/usr/local/bin:${PATH})
brew install maven
brew install cntlm

Settings -> Keyboard/Text: disable use smart quotes and dashes

Finder -> Preferences/Advanced/Show all filename extensions