ETL project that collects data from the Brazilian Electoral Data Repository and imports it into a MySQL database, and then the data is placed in Elasticsearch using the ELK stack. Electoral years available: 2010, 2012, 2014, 2016, 2018, 2020. The full import of every year totals 28.842.859 million lines in the database.
This is a raw data processing (ETL) project for standardizing information for a MySQL Database. The Brazilian Electoral History data is made available by TSE on the website TSE Electoral Data Repository.
The process of importing the data into an SQL database aims to facilitate the development of other systems as well as for statistical analysis.
This project uses the following technologies:
Five tables are created in the database called (electoral_history) to store the data. Are they:
Table | Lines | Description |
---|---|---|
raw_tse_consult_candidates | 16.177.96 | Contains all candidates. |
raw_tse_voting_cand_city | 20.954.033 | Contains candidate votes by city. |
raw_tse_voting_party_city | 16.671.05 | Contains party votes by city. |
raw_tse_cand_goods_declared | 31.428.04 | Contains the goods declared candidates. |
cand_info | 14.611.21 | Merge with candidate information. |
You must have it installed on your workstation:
And preferably use a GNU Linux distribution.
Now run the commands below to compile the project:
$ git clone https://github.com/renovabr/electoral-history.git
$ cd electoral-history
Enter pipenv at the root of the electoral-history folder for the tests:
$ pipenv shell
$ pipenv install
The tse-data script downloads the raw data using the year as a parameter. Example to download the year 2020:
$ ./tse-data 2020
You can also download them all using the command:
$ for i in "2010" "2012" "2014" "2016" "2018" "2020"; do ./tse-data ${i}; done
Wait for the data to download! The data is downloaded in the project folder at: data/tse/YEAR
Access the database with user root check if everything is working.
$ mysql -h 127.0.0.1 -u root -p
Then run the sql/schema.sql that is within the project.
$ mysql -h 127.0.0.1 -u root -p < sql/schema.sql
Note: A database will be created called: electoral_historal with three tables and a user with access to a database called winston. The database connection settings can be changed in the config.ini file.
The scripts below import the data from the 4 table:
- consulta-cand -> raw_tse_consult_candidates
- votacao-candidato -> raw_tse_voting_cand_city
- votacao-partido -> raw_tse_voting_party_city
- bem-candidato -> raw_tse_cand_goods_declared
The tables with indication raw_ are the raw data. At the end a merge of all tables is created, this final result is available in the cand_info table.
To import the candidates data for all years, use the script: run-import-all.
$ ./01-run-import-all 'consulta-cand'
If you want to import a specific year and state you can use the command:
$ cd src
$ ./01-consulta-cand.py -y 2020 \
-s 'AC' -p ../data/tse/2020/consulta_cand_2020/ -e csv
Or import the year all from all states:
$ cd src
$ ./01-consulta-cand.py -y 2020 \
-p ../data/tse/2020/consulta_cand_2020/ -e csv
Note: The years (2010 and 2012) the extension is txt, the others are csv.
The same procedure applies to the other tables. To import all candidates voting data, use the command:
$ ./01-run-import-all 'votacao-candidato'
If you want to import a specific year and state you can use the command:
$ cd src
$ ./02-votacao-candidato-munzona.py -y 2020 \
-s 'AC' -p ../data/tse/2020/votacao_candidato_munzona_2020/ -e csv
Or import the year all from all states:
$ cd src
$ ./02-votacao-candidato-munzona.py -y 2020 \
-p ../data/tse/2020/votacao_candidato_munzona_2020/ -e csv
Note: The years (2010 and 2012) the extension is txt, the others are csv.
For the import of votes by party:
$ ./01-run-import-all 'votacao-partido'
If you want to import a specific year and state you can use the command:
$ cd src
$ ./03-votacao-partido-munzona.py -y 2020 \
-s 'AC' -p ../data/tse/2020/votacao_partido_munzona_2020/ -e csv
Or import the year all from all states:
$ cd src
$ ./03-votacao-partido-munzona.py -y 2020 \
-p ../data/tse/2020/votacao_partido_munzona_2020/ -e csv
Note: The years (2010 and 2012) the extension is txt, the others are csv.
To import candidates' declared goods:
$ ./01-run-import-all 'bem-candidato'
If you want to import a specific year and state you can use the command:
$ cd src
$ ./04-bens-candidato.py -y 2020 \
-s 'AC' -p ../data/tse/2020/bem_candidato_2020/ -e csv
Or import the year all from all states:
$ cd src
$ ./04-bens-candidato.py -y 2020 \
-p ../data/tse/2020/bem_candidato_2020/ -e csv
Note: The years (2010 and 2012) the extension is txt, the others are csv.
This table is a junction of candidate data with the voting base.
For the import of all years:
$ ./02-run-cand-info
Note: The script automatically imports data from the shift 1 and after the shift 2.
If you want a specific year, shift 1:
$ cd src
$ ./05-cand-info.py -y 2020 -t 1
Same year for shift 2:
$ cd src
$ ./05-cand-info.py -y 2020 -t 2
If you want to listen to only one state and save it in a CSV file, use the command:
$ cd src
$ ./05-cand-info.py -y 2020 -t 1 -s 'AC' -e ac.csv
Or to process every year use the command:
$ cd src
$ for i in "2010" "2012" "2014" "2016" "2018" "2020"; do ./04-cand-info.py -y ${i} -t 1; done
$ for i in "2010" "2012" "2014" "2016" "2018" "2020"; do ./04-cand-info.py -y ${i} -t 2; done
The script must be executed per year and the results are saved in table cand_info
To process the number of votes and other information for vice-mayor positions, it is necessary to use the script:
$ ./03-run-cand-info-vices
Note: The script automatically imports data from the shift 1 and after the shift 2.
If you want a specific year, shift 1:
$ cd src
$ ./05-cand-info-vices.py -y 2020 -t 1
Same year for shift 2:
$ cd src
$ ./05-cand-info-vices.py -y 2020 -t 2
There is a data dictionary that can be downloaded here: Dictionary.
All tables are partitioned by the year of election.
SELECT COUNT(1) FROM raw_tse_consult_candidates PARTITION (p2020);
Checking positions in the 2016 election:
SELECT
ds_position
FROM
raw_tse_voting_cand_city
WHERE
election_year = '2016'
GROUP BY
1
ORDER BY
1;
2016 |
---|
Prefeito |
Vereador |
Checking positions in the 2018 election:
SELECT
ds_position
FROM
raw_tse_voting_cand_city
WHERE
election_year = '2018'
GROUP BY
1
ORDER BY
1;
2018 |
---|
Deputado Distrital |
Deputado Estadual |
Deputado Federal |
Governador |
Senador |
Checking the total number of votes of the Governors of the state of Santa Catarina in the city of Florianópolis in the first shift of the 2018 elections.
SELECT
sq_candidate AS SQ,
nm_ballot_candidate AS Name,
ds_position AS Position,
nm_city AS City,
format(sum(qt_votes_nominal), 0, 'de_DE') AS Votes
FROM
raw_tse_voting_cand_city
WHERE
election_year = '2018'
AND sg_uf = 'SC'
AND cd_city = 81051
AND nr_shift = 1
AND cd_position = 3
GROUP BY
1,
2,
3,
4
ORDER BY
sum(qt_votes_nominal) DESC;
SQ | Name | Position | City | Votes |
---|---|---|---|---|
240000609724 | COMANDANTE MOISÉS | Governador | FLORIANÓPOLIS | 73.947 |
240000621321 | GELSON MERÍSIO | Governador | FLORIANÓPOLIS | 59.524 |
240000609537 | MAURO MARIANI | Governador | FLORIANÓPOLIS | 43.796 |
240000624336 | DÉCIO LIMA | Governador | FLORIANÓPOLIS | 39.144 |
240000601841 | CAMASÃO | Governador | FLORIANÓPOLIS | 19.362 |
240000616318 | PORTANOVA | Governador | FLORIANÓPOLIS | 4.844 |
240000610038 | INGRID ASSIS | Governador | FLORIANÓPOLIS | 1.644 |
240000614244 | JESSE PEREIRA | Governador | FLORIANÓPOLIS | 1.281 |
Checking the total number of votes of the Governors of the State of Santa Catarina in the second shit of the 2018 elections.
SELECT
sq_candidate AS SQ,
nm_ballot_candidate AS Name,
ds_position AS Position,
format(sum(qt_votes_nominal), 0, 'de_DE') AS Votes
FROM
raw_tse_voting_cand_city
WHERE
election_year = '2018'
AND sg_uf = 'SC'
AND nr_shift = 2
AND cd_position = 3
GROUP BY
1,
2,
3
ORDER BY
sum(qt_votes_nominal) DESC;
SQ | Name | Position | Votes |
---|---|---|---|
240000609724 | COMANDANTE MOISÉS | Governador | 2.644.179 |
240000621321 | GELSON MERÍSIO | Governador | 1.075.242 |
Checking the total number of votes of the Governors of the State of São Paulo in the first shift of the 2018 elections.
SELECT
sq_candidate AS SQ,
nm_ballot_candidate AS Name,
ds_position AS Position,
format(sum(qt_votes_nominal), 0, 'de_DE') AS Votes
FROM
raw_tse_voting_cand_city
WHERE
election_year = '2018'
AND sg_uf = 'SP'
AND nr_shift = 1
AND cd_position = 3
GROUP BY
1,
2,
3
ORDER BY
sum(qt_votes_nominal) DESC;
SQ | Name | Position | Votes |
---|---|---|---|
250000612596 | JOÃO DORIA | Governador | 6.431.555 |
250000615141 | MARCIO FRANÇA | Governador | 4.358.998 |
250000604077 | PAULO SKAF | Governador | 4.269.865 |
250000623884 | LUIZ MARINHO | Governador | 2.563.922 |
250000612133 | MAJOR COSTA E SILVA | Governador | 747.462 |
250000601939 | ROGERIO CHEQUER | Governador | 673.102 |
250000615464 | RODRIGO TAVARES | Governador | 649.729 |
250000601522 | PROFESSORA LISETE | Governador | 507.236 |
250000617766 | PROF. CLAUDIO FERNANDO | Governador | 28.666 |
250000609174 | TONINHO FERREIRA | Governador | 16.202 |
Which top 10 city has the most votes for a distinguished candidate example the candidate (250000612596) to governor of the state of São Paulo, for the second shift.
SELECT
nm_city AS City,
sum(qt_votes_nominal) AS Votes
FROM
raw_tse_voting_cand_city
WHERE
election_year = '2018'
AND sg_uf = 'SP'
AND nr_shift = 2
AND cd_position = 3
AND sq_candidate = 250000612596
GROUP BY
1
ORDER BY
2 DESC LIMIT 10;
City | Votes |
---|---|
SÃO PAULO | 2447309 |
CAMPINAS | 315524 |
GUARULHOS | 240825 |
SÃO JOSÉ DOS CAMPOS | 232775 |
SOROCABA | 207470 |
SANTO ANDRÉ | 202125 |
SÃO BERNARDO DO CAMPO | 196202 |
OSASCO | 176109 |
RIBEIRÃO PRETO | 166728 |
JUNDIAÍ | 143028 |
Checking the total amount of mayors of Rio Branco in Acre for the 2º shift (2020 year).
SELECT
sq_candidate AS SQ,
nm_ballot_candidate AS Name,
cd_position AS CodePosition,
ds_position AS Position,
nm_city AS City,
format(sum(qt_votes_nominal), 0, 'de_DE') AS Votes
FROM
raw_tse_voting_cand_city
WHERE
election_year = '2020'
AND sg_uf = 'AC'
AND cd_city = 1392
AND nr_shift = 2
AND cd_position = 11
GROUP BY
1,
2,
3,
4,
5
ORDER BY
sum(qt_votes_nominal) DESC;
SQ | Name | CodePosition | Position | Votes |
---|---|---|---|---|
10001003637 | TIÃO BOCALOM | 11 | Prefeito | 104.746 |
10001154203 | SOCORRO NERI | 11 | Prefeito | 61.702 |
Checks all elected councilors of Rio Branco in Acre:
SELECT
sq_candidate AS SQ,
nm_ballot_candidate AS Name,
cd_position AS CodePosition,
ds_position AS Position,
ds_situ_tot_shift AS Situation,
nm_city AS City,
format(sum(qt_votes_nominal), 0, 'de_DE') AS Votes
FROM
raw_tse_voting_cand_city
WHERE
election_year = '2020'
AND sg_uf = 'AC'
AND cd_city = 1392
AND nr_shift = 1
AND cd_position = 13
AND ((cd_situ_tot_shift = 1) OR (cd_situ_tot_shift = 2) OR (cd_situ_tot_shift = 3))
GROUP BY
1,
2,
3,
4,
5,
6
ORDER BY
sum(qt_votes_nominal) DESC;
SQ | Name | CodePosition | Position | Situation | City | Votes |
---|---|---|---|---|---|---|
10000851435 | DRA. MICHELLE MELO | 13 | Vereador | ELEITO POR QP | RIO BRANCO | 3.576 |
10000848795 | SAMIR BESTENE | 13 | Vereador | ELEITO POR QP | RIO BRANCO | 3.403 |
10000969481 | ANTONIO MORAIS | 13 | Vereador | ELEITO POR QP | RIO BRANCO | 2.933 |
10000699918 | EMERSON JARUDE | 13 | Vereador | ELEITO POR QP | RIO BRANCO | 2.765 |
10000848788 | VER. CAP. N LIMA' | 13 | Vereador | ELEITO POR MÉDIA | RIO BRANCO | 2.559 |
10000969483 | RAIMUNDO NENÉM | 13 | Vereador | ELEITO POR MÉDIA | RIO BRANCO | 2.555 |
10000851433 | JOAQUIM FLORÊNCIO | 13 | Vereador | ELEITO POR MÉDIA | RIO BRANCO | 2.411 |
10000848803 | RUTÊNIO SÁ | 13 | Vereador | ELEITO POR MÉDIA | RIO BRANCO | 2.271 |
10000969498 | ADAILTON CRUZ | 13 | Vereador | ELEITO POR MÉDIA | RIO BRANCO | 2.222 |
10000968150 | ARNALDO BARROS | 13 | Vereador | ELEITO POR MÉDIA | RIO BRANCO | 2.132 |
10000851443 | FÁBIO ARAÚJO | 13 | Vereador | ELEITO POR MÉDIA | RIO BRANCO | 2.086 |
10001014911 | ISMAEL MACHADO | 13 | Vereador | ELEITO POR QP | RIO BRANCO | 1.928 |
10000670251 | LENE PETECÃO | 13 | Vereador | ELEITO POR QP | RIO BRANCO | 1.789 |
10000701028 | FRANCISCO PIABA | 13 | Vereador | ELEITO POR MÉDIA | RIO BRANCO | 1.659 |
10000699922 | CÉLIO GADELHA | 13 | Vereador | ELEITO POR MÉDIA | RIO BRANCO | 1.293 |
10001014897 | DR RAIMUNDO CASTRO | 13 | Vereador | ELEITO POR MÉDIA | RIO BRANCO | 1.268 |
10001079047 | HILDEGARD PASCOAL | 13 | Vereador | ELEITO POR MÉDIA | RIO BRANCO | 1.130 |